Why DORA Metrics Break in the AI Era

March 20, 2026

TL;DR

DORA metrics were designed for a world where humans wrote all the code. That world no longer exists. When AI generates 30-70% of committed code, the assumptions underneath Deployment Frequency, Lead Time, and Change Failure Rate quietly collapse.
Not all DORA metrics break equally. MTTR holds up well. Change Failure Rate has partial value. Deployment Frequency and Lead Time become misleading without additional context.
Code churn has doubled since AI coding went mainstream. GitClear research shows churn rising from 3.3% to 5.7-7.1% — a signal DORA was never designed to capture.
The fix is evolution, not abandonment. Keep what works, add what’s missing: AI code share, code durability, complexity-adjusted throughput, and innovation rate.

DORA Metrics: A Quick Recap

In 2018, Dr. Nicole Forsgren, Jez Humble, and Gene Kim published Accelerate, giving engineering leadership something it had never had: a shared, research-backed language for measuring software delivery performance. The DORA metrics — named for the DevOps Research and Assessment team — became the standard. They showed up in board decks, vendor pitches, and engineering reviews worldwide.

That contribution deserves real respect. Before DORA, engineering productivity conversations were a mess of gut feelings, lines-of-code counts, and story point debates. DORA cut through the noise with four metrics grounded in research across thousands of organizations.

Here’s the framework at a glance:

Metric	What It Measures	The Assumption
Deployment Frequency	How often code ships to production	More deploys = healthier, more capable team
Lead Time for Changes	Time from commit to production	Shorter = better flow through the pipeline
Mean Time to Recovery	How quickly incidents are resolved	Faster recovery = more resilient systems
Change Failure Rate	Percentage of deploys causing failures	Lower rate = higher code quality

These metrics worked — and worked well — in a world where humans wrote the code, humans reviewed the code, and the primary bottleneck was getting that human-written code safely into production.

That world started changing in late 2022. By 2025, it had changed fundamentally. And by 2026, teams relying solely on DORA metrics are measuring the wrong things.

How AI Breaks Each Metric

The problem is not that DORA metrics produce wrong numbers. They still measure exactly what they always measured. The problem is that what they measure no longer means what it used to mean.

Deployment Frequency

The assumption: More frequent deployments indicate a healthier, more capable engineering team. High-performing teams deploy on demand, multiple times per day. Low performers deploy between once per month and once every six months.

How AI breaks it: AI-assisted workflows can dramatically increase deployment frequency without a corresponding increase in meaningful output. When Copilot, Cursor, or similar tools generate boilerplate, test scaffolding, and configuration changes, deploy counts inflate — sometimes dramatically.

Consider a concrete example. A team of eight engineers ships five deploys per week — a solid pace. They adopt AI coding tools. Within two months, they’re shipping twenty deploys per week. By DORA standards, they’ve jumped from “medium” to “elite” performance.

But look closer at those fifteen additional weekly deploys:

Four are AI-generated test files that mirror existing test patterns
Three are configuration changes and dependency updates AI suggested
Five are boilerplate service scaffolding that AI produced in minutes
Two are small refactors AI flagged and auto-completed
One is an actual new feature

Deployment frequency quadrupled. Meaningful output barely moved.

This is not a hypothetical. Teams across the industry are reporting inflated deploy counts post-AI adoption. The metric is not lying — code is shipping more frequently. But the assumption that higher frequency reflects a healthier team no longer holds when a significant portion of that frequency is driven by AI-generated throughput rather than human engineering judgment.

The question to ask instead: “What percentage of our deploys ship customer-facing features versus infrastructure changes versus AI-generated boilerplate?”

Lead Time for Changes

The assumption: Shorter time from first commit to running in production means better flow through the delivery pipeline. High performers have lead times under one day. Low performers take between one and six months.

How AI breaks it: AI can produce a complete pull request — code, tests, documentation — in minutes that previously took a developer days. Lead time from commit to production plummets. On paper, every team using AI coding tools looks like an elite performer.

But the bottleneck has shifted. Before AI, the constraint was usually writing the code. Developers understood what they wrote because they wrote it. Reviews were faster because the reviewer could reason about the author’s intent by reading the code.

Now the constraint is review. An AI-generated PR may be syntactically correct, pass all automated checks, and still require significant review time because:

The reviewer didn’t write the code and must build a mental model from scratch
AI-generated code often works but follows patterns the team doesn’t use elsewhere
Subtle architectural mismatches are harder to spot in code you didn’t author
The volume of AI-generated PRs can overwhelm review capacity

Lead time may drop from three days to three hours — while the time a PR spends waiting for meaningful human review actually increases. The metric now measures “how fast AI generates code” more than “how efficiently our delivery pipeline moves work to production.”

Several teams have reported a revealing pattern: lead time improved by 60-70% after AI adoption, but the percentage of that lead time spent in code review grew from 20% to over 50%. The total pipeline is faster, but the human bottleneck is more concentrated than ever.

The question to ask instead: “Where is our actual bottleneck now — writing, reviewing, or deploying? And how is the shift to AI-generated code changing review quality and depth?”

Mean Time to Recovery (MTTR)

The assumption: Faster incident recovery indicates more resilient systems and more capable operations teams.

How AI breaks it: Here’s the good news: MTTR is the DORA metric least affected by AI-assisted coding. Incident response remains a fundamentally human activity. When production breaks at 2 AM, it is still an on-call engineer triaging alerts, reading logs, forming hypotheses, and deploying fixes. AI coding tools help at the margins — generating a hotfix faster, suggesting a rollback command — but the core loop is human judgment under pressure.

MTTR remains a valid and useful metric in 2026.

The caveat: While MTTR itself holds up, there is a related concern. If AI-generated code introduces more subtle, harder-to-diagnose bugs — and early evidence suggests it can — then MTTR might stay flat while incident frequency quietly rises. A team recovering from incidents in thirty minutes is impressive. A team recovering from incidents in thirty minutes but having three times as many incidents has a different problem entirely.

MTTR needs a companion metric: incident frequency, ideally broken down by whether the triggering code was human-written or AI-generated. Recovery speed without incidence rate tells an incomplete story.

Change Failure Rate

The assumption: A lower percentage of deployments causing production failures indicates higher code quality and better engineering practices.

How AI breaks it: This is where the gap between what DORA measures and what actually matters becomes most apparent.

AI-generated code is remarkably good at passing tests. This makes sense — modern AI coding tools have been trained on millions of test files and understand test patterns deeply. When AI writes code and the accompanying tests, both tend to follow well-established patterns. Change failure rate — which measures production-visible failures — may hold steady or even improve.

But beneath that stable surface, something else is happening.

GitClear’s research across millions of lines of code provides the clearest signal. Their analysis of AI-assisted development patterns shows code churn — the percentage of code that is rewritten or reverted within two weeks of being committed — roughly doubled from a baseline of approximately 3.3% to between 5.7% and 7.1% as AI coding tools gained widespread adoption across the industry.¹

Let that sink in. Code is being rewritten or reverted at double the historical rate, but change failure rate — the DORA metric meant to catch quality problems — does not capture this. Change failure rate only sees production failures: outages, error spikes, rollbacks triggered by monitoring. It misses code that technically works but is fragile, duplicative, or architecturally inconsistent enough that another developer rewrites it within weeks.

This is the quality gap DORA was never designed to detect. Code can pass every test, never cause a production incident, and still represent engineering waste if it does not survive contact with the broader codebase.

The question to ask instead: “What percentage of our AI-generated code survives 30 days without being reverted or substantially rewritten? And how does that compare to our human-written code?”

What DORA Doesn’t Measure (And Now Needs To)

The gaps in DORA are not flaws in the original research. They are artifacts of a world that no longer exists. When all code was human-written, several things could be safely assumed. Those assumptions now need to be measured explicitly:

AI Code Share. What percentage of committed code is AI-generated versus human-written? DORA is entirely silent on this question because in 2018 it did not need to be asked. In 2026, a team’s AI code share — and how it trends over time — is foundational context for interpreting every other metric. A team where 20% of code is AI-generated and a team where 70% is AI-generated are operating in fundamentally different modes, even if their DORA numbers are identical.

Code Durability. Does code stick, or does it churn? Change failure rate catches code that breaks production. It does not catch code that quietly gets rewritten, refactored, or replaced within weeks. Code durability — the percentage of code surviving 14 or 30 days without substantial modification — is the quality signal that matters when AI increases code volume dramatically.

Complexity-Adjusted Output. Is AI generating Easy work or Hard work? DORA treats all deployments as equal. A deploy that ships a new authentication system counts the same as a deploy that updates a README. When AI handles the simple tasks — and it handles them well — raw throughput metrics inflate without reflecting meaningful engineering progress. Output needs to be weighted by complexity to remain interpretable.

Innovation Rate. Are we shipping new features or just maintaining existing ones? DORA does not distinguish between a deploy that launches a new product capability and one that patches a dependency. As AI takes over maintenance-oriented work (and it is particularly good at this), the ratio of innovation to maintenance in deploy counts shifts. Without tracking innovation rate explicitly, teams can look highly productive while their product stagnates.

What to Use Instead

DORA gave engineering leadership a shared language. What comes next should extend that language, not discard it.

The Developer AI Impact Framework is built on a simple principle: keep what works from DORA, evolve what is partially broken, replace what is fundamentally misleading, and add what is newly necessary.

Here is the short version:

Keep. MTTR remains a valid metric. It measures something real — incident recovery speed — that has not been distorted by AI-assisted development. Change Failure Rate retains partial value as a trailing indicator of production quality, though it needs supplementation.

Evolve. Add Code Turnover Rate alongside Change Failure Rate to capture the quality signals that production failure rates miss. Add complexity weighting to throughput metrics so that AI-generated boilerplate does not inflate output numbers. These are not replacements — they are necessary companions to existing metrics.

Replace. Deployment Frequency and Lead Time for Changes need fundamental rethinking when AI writes a significant share of the code. Raw counts and raw speed are no longer reliable proxies for team health. Complexity-Adjusted Throughput (CAT) replaces the raw volume measure with one that accounts for what kind of work is being done and how much of it survives.

Add. AI Adoption Rate, AI Code Share, Code Durability, and Innovation Rate are metrics designed for how software is actually built in 2026. They address questions DORA could not have anticipated: How much of our code is AI-generated? Is it durable? Are we innovating or just maintaining?

The full framework details how these metrics work together, how to implement them, and what benchmarks to target.

Read the full Developer AI Impact Framework –>

Frequently Asked Questions

Are DORA metrics still useful in 2026?

Yes, but they are no longer sufficient on their own. DORA metrics still measure real things — deployment speed, recovery time, failure rates. The problem is that AI-assisted development has changed what those measurements mean. MTTR remains fully valid. Change Failure Rate has partial value. Deployment Frequency and Lead Time have become misleading without additional context about AI code share and code complexity. Think of DORA as a necessary but incomplete foundation.

What metrics should replace DORA for AI-native teams?

You do not need to replace DORA entirely — you need to extend it with AI-aware metrics. The Developer AI Impact Framework adds AI Code Share (what percentage of code is AI-generated), Code Turnover Rate (how much code survives beyond 30 days), Complexity-Adjusted Throughput (output weighted by difficulty), and Innovation Rate (new features versus maintenance). Together with the DORA metrics that still hold up, these form a complete picture of engineering performance in an AI-native environment.

Does AI coding make deployment frequency meaningless?

Not meaningless, but substantially less informative. Deployment frequency still tells you something — a team deploying zero times per month has a different problem than a team deploying fifty times. But when AI can generate and ship boilerplate code at high volume, the gap between deployment count and meaningful output widens. The fix is not to stop measuring deploy frequency but to pair it with complexity-adjusted throughput, which distinguishes between a deploy that ships a payment processing overhaul and one that adds an AI-generated utility function.

How has code churn changed since AI coding tools became mainstream?

Code churn has roughly doubled. GitClear’s analysis across millions of lines of code found that code churn — the rate at which recently written code is rewritten or reverted — increased from a historical baseline of approximately 3.3% to between 5.7% and 7.1% as AI coding tools gained widespread adoption.¹ This means AI-generated code is being discarded or substantially rewritten at roughly twice the rate of human-written code historically. Notably, standard DORA metrics did not flag this quality shift because the rewritten code typically did not cause production failures — it was simply not good enough to keep.

Can you use DORA and AI-native metrics together?

Yes, and that is the recommended approach. The goal is not to discard DORA but to layer AI-aware metrics on top of the DORA foundation. Keep MTTR as-is. Supplement Change Failure Rate with Code Turnover Rate. Replace raw Deployment Frequency with Complexity-Adjusted Throughput. Add entirely new dimensions like AI Code Share and Innovation Rate. This gives you backward compatibility with existing DORA benchmarks while adding the visibility that AI-assisted development demands.

Additional references:

Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press. The foundational research behind DORA metrics.
DORA Team. “DORA | Get Better at Getting Better.” dora.dev. The ongoing research program that maintains and evolves the DORA framework.
Larridin. “The Developer AI Impact Framework.” The framework referenced in this article for extending DORA metrics to account for AI-assisted development.

Related Resources

The Developer AI Impact Framework
Developer Productivity Benchmarks 2026
What Is Complexity-Adjusted Throughput?
Code Churn in the AI Era (coming soon)
DORA Metrics Explained: The Complete Guide (coming soon)

GitClear, “Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality” (2024) and subsequent 2025 analysis. GitClear analyzed code contribution patterns across a large dataset and found that metrics associated with code churn — including moved code, copy/pasted code, and code updated shortly after creation — increased significantly concurrent with widespread AI coding tool adoption. The approximately 3.3% to 5.7-7.1% churn increase cited throughout this article is based on their published research. ↩↩