Artificial Intelligence

Reducing Noise Factors When Evaluating GenAI

6 min read• Dec 29, 2025

Written by

Milestone Team

The uncomfortable truth: measuring GenAI is harder than adopting it

Google’s DORA 2025 research shows 90% of software engineers already use some form of GenAI coding assistant. As this becomes standard practice, many organizations assume that once these tools are rolled out, productivity metrics will naturally reflect their gains.

The pattern has become familiar: teams enable tools like Cursor or GitHub Copilot, PRs appear to move faster, developers report feeling more productive, dashboards show increased velocity, and leadership presents the results as: “GenAI increased engineering productivity by X%.”

That is where attribution becomes difficult.

Did GenAI drive the improvement, or did it coincide with the completion of a major refactor? Did cycle time drop because of AI assistance, or because more senior engineers happened to staff the project?

Without deliberate effort to separate signal from coincidence, GenAI evaluations tend to fail in one of two ways. Some organizations declare success too early, relying on shallow indicators. Others pull back entirely because results feel inconsistent or contradictory.

The goal of this article is to outline how engineering leaders can reduce this attribution noise and measure GenAI’s impact in a way that stands up to real, high-stakes decisions.

What does “noise” actually mean in GenAI evaluation?

Before noise can be reduced, it needs to be defined.

In the context of GenAI productivity impact, noise refers to the variables that blur the relationship between AI adoption and engineering outcomes. They’re the alternative explanations for any change you observe, the variables that make causation impossible to establish without careful isolation.

In practice, this includes:

Different team compositions (senior-heavy vs. junior-heavy).
Variations in review culture and code ownership.
Shifts in work type (feature development vs. maintenance vs. migration).
Differences in tech stacks and domains.
Changes in delivery pressure or roadmap priorities.

A common mistake is assuming they naturally “average out” at scale.

GenAI adoption rarely happens uniformly. It starts with specific teams, specific repositories, and specific types of work. Even in organizations where most engineers have access to AI tools, only a subset of commits and pull requests are meaningfully influenced by them.

Usage varies over time and by team and work type. Partial adoption, combined with uneven application across testing, refactoring, and feature development, collapses fundamentally different workflows into a single comparison group. When these heterogeneous patterns are averaged together, attribution breaks down and genuine effects are obscured.

This is why reducing noise is not about suppressing data; it’s about contextualizing it.

And among all the factors, one stands out as both highly impactful and widely misunderstood: PR size.

PR size is not the problem - unexamined PR growth is

Review depth & PR size monitoring across an organization – Milestone Dashboard

Across most organizations, GenAI-assisted pull requests often appear to take longer to merge than comparable non-GenAI PRs when viewed in aggregate. At first glance, this seems contradictory. If GenAI accelerates code generation, cycle time, coding time, and review time should all improve.

The contradiction disappears once PR size is treated as a structural signal rather than a performance metric.

GenAI lowers the marginal cost of producing code. When the cost curve changes, behavior changes: GenAI alters what engineers choose to bundle into a single unit of work.

As a result, engineers tend to package more work into each pull request: additional boilerplate, broader refactors, expanded test coverage, and opportunistic cleanup that would previously have been deferred. This behavior is rational and often well-intentioned, but it materially changes how work flows through the system.

Larger and denser PRs behave differently by design. They increase reviewers’ cognitive load, lengthen review cycles, and increase the likelihood of rework. In effect, GenAI shifts the bottleneck downstream, from code production to review, validation, and coordination.

This shift is easy to miss in aggregate metrics. A growing volume of small, fast-merging PRs can make overall cycle time appear healthier, even as a smaller number of large PRs quietly absorb most of the risk, review effort, and delay. When these fundamentally different workflows are averaged together, the result is misleading attribution rather than insight.

This is why PR size must be treated as context, not as a verdict.

GenAI should not be evaluated by how much code it produces, but by whether it improves efficiency, security, and long-term maintainability.

What the Milestone dashboard makes visible (and why that matters)

Number of PRs per PR size group – Milestone Dashboard

Once PR size is treated as context rather than a performance signal, the evaluation problem becomes measurable.

The Milestone dashboard is designed to explicitly surface this context. Instead of aggregating all GenAI activity into a single average, it decomposes engineering work along dimensions that materially affect delivery outcomes.

In practice, this allows leaders to observe:

How pull request volume is distributed across PR size groups.
How cycle time, coding time, and review time evolve within each size band.
How GenAI-assisted and non-GenAI PRs behave when compared on a like-for-like basis.
How these patterns vary by team and over time.

This changes the nature of the questions leaders can ask. Rather than debating whether “GenAI is slowing us down,” they can see where friction is emerging and why. PR size inflation becomes visible instead of inferred. Review delays can be examined within comparable scopes rather than averaged across unrelated workflows.

Crucially, this avoids one of the most common evaluation errors: drawing conclusions from aggregate metrics that mix fundamentally different workflows.

From noisy metrics to clear decisions

Productivity Impact Assessment in Milestone Platform

Once PR size is used as context rather than a proxy for performance, the conversation shifts from reaction to intent.

Leaders can determine:

Where GenAI usage is healthy and well-contained.
Where PR discipline or review practices need adjustment.
Where tooling accelerates work rather than amplifies bloat.
How to evolve review workflows to match AI-augmented development.

This is the difference between reacting to metrics and understanding systems. GenAI does not simply change how fast code is written. It changes how work is packaged, reviewed, and shipped. Milestone does not hide that complexity-it makes it observable, so engineering leaders can act with confidence rather than intuition.

Interpreting GenAI impact across tech stacks and teams

Different programming languages, frameworks, and repository structures interact with GenAI in fundamentally different ways. A boilerplate-heavy codebase may see rapid throughput gains, whereas highly customized, legacy codebases often incur higher review and integration costs. Comparing teams without accounting for these differences introduces another layer of noise.

By breaking metrics down by repository, language, and team, leaders can observe patterns that are otherwise invisible. These patterns are not indicators of team quality. Without this context, teams are often misclassified as inefficient when they are, in fact, operating under different structural constraints, such as language choice, codebase maturity, or architectural complexity.

What disciplined GenAI evaluation unlocks for engineering leaders

When GenAI impact is measured with proper context, leaders can explain why metrics move and not just that they moved. Cycle time increases can be traced to PR size rather than assumed inefficiency. Review delays can be traced to workflow changes rather than to individual performance. Efficiency gains can be demonstrated where they genuinely exist.

This enables ROI narratives that stand up to executive scrutiny because they are grounded in engineering reality rather than enthusiasm for new tools.

Disciplined evaluation also ensures safer, more targeted GenAI rollouts. Not every team, repository, or workflow benefits from GenAI in the same way. Contextual measurement reveals where adoption is healthy, where it requires guardrails, and where expectations need recalibration.

Equally important, a transparent and well-structured evaluation reframes GenAI as a productivity amplifier rather than a replacement. Engineers, data scientists, and analysts can see that judgment, review discipline, and design decisions remain central, with leadership able to reinforce this through evidence.

Closing: from experimentation to production

Most organizations are already past the experimentation phase. GenAI tools are in daily use, and adoption continues to rise. The real challenge now is operational maturity, and reducing noise is the necessary first step.

Until teams can reliably separate signal from noise, every downstream decision, like governance, policy, tooling, and investment, needs to be addressed.

GenAI doesn’t fail because it lacks potential. It fails when we measure it as a demo rather than as production engineering.

Milestone’s engineering intelligence platform was built to support this shift. It enables teams to evaluate GenAI’s impact with statistical defensibility, using the same standards of rigor they apply to production systems, with a sustained focus on productivity and code quality. By narrowing noise, it allows leaders to make decisions grounded in evidence rather than intuition.

Tips & Trick

Mar 04, 2026

Best 8 Engineering Analytics Platforms of 2026

Written by

Milestone Team

Technology

Mar 03, 2026

Measuring Controlled Acceleration in the Era of Agentic Engineering

Written by

Stephen Barrett

Technology

Mar 02, 2026

10 Best AI Tools for Coding in 2026

Written by

Milestone Team

Reducing Noise Factors When Evaluating GenAI

The uncomfortable truth: measuring GenAI is harder than adopting it

What does “noise” actually mean in GenAI evaluation?

PR size is not the problem - unexamined PR growth is

What the Milestone dashboard makes visible (and why that matters)

From noisy metrics to clear decisions

Interpreting GenAI impact across tech stacks and teams

What disciplined GenAI evaluation unlocks for engineering leaders

Closing: from experimentation to production

Related posts

Best 8 Engineering Analytics Platforms of 2026

Measuring Controlled Acceleration in the Era of Agentic Engineering

10 Best AI Tools for Coding in 2026

Ready to Transform
Your GenAI
Investments?

Reducing Noise Factors When Evaluating GenAI

The uncomfortable truth: measuring GenAI is harder than adopting it

What does “noise” actually mean in GenAI evaluation?

PR size is not the problem - unexamined PR growth is

What the Milestone dashboard makes visible (and why that matters)

From noisy metrics to clear decisions

Interpreting GenAI impact across tech stacks and teams

What disciplined GenAI evaluation unlocks for engineering leaders

Closing: from experimentation to production

Related posts

Best 8 Engineering Analytics Platforms of 2026

Measuring Controlled Acceleration in the Era of Agentic Engineering

10 Best AI Tools for Coding in 2026

Ready to Transform Your GenAI Investments?

Ready to Transform
Your GenAI
Investments?