DORA Metrics Explained: The Complete Guide (2026) | Developer Productivity

Written by Larridin | Jan 1, 1970 12:00:00 AM

TL;DR

DORA metrics are four metrics for measuring software delivery performance, developed by the DevOps Research and Assessment (DORA) team: Deployment Frequency, Lead Time for Changes, Mean Time to Recovery (MTTR), and Change Failure Rate.
They were introduced by Dr. Nicole Forsgren, Jez Humble, and Gene Kim in the 2018 book Accelerate and are backed by years of research across thousands of engineering organizations.
DORA metrics remain the most widely adopted engineering measurement framework in the industry. Their contribution -- giving engineering leaders a shared, research-backed language for delivery performance -- was transformational.
In 2026, the AI coding era has partially broken the assumptions underneath DORA. MTTR holds up well. Change Failure Rate has partial value. Deployment Frequency and Lead Time for Changes have become misleading when AI generates 30-70% of committed code. For a detailed analysis, see Why DORA Metrics Break in the AI Era.
The path forward is evolution, not abandonment. Keep DORA as a foundation. Extend it with AI attribution, code durability, and complexity-adjusted throughput to account for AI-generated code.

What Are DORA Metrics?

DORA metrics are a set of four metrics that measure software delivery performance. They were developed by the DevOps Research and Assessment (DORA) team -- a research program that, starting in 2014, surveyed thousands of engineering organizations to identify the capabilities and practices that predict high performance in software delivery.

The core finding was straightforward and powerful: organizations that excel at software delivery -- shipping code frequently, quickly, reliably, and with fast recovery from failures -- outperform their peers on every business metric that matters. Revenue growth, market share, profitability, and employee satisfaction all correlate with strong software delivery performance.

DORA gave engineering leaders something the industry had never had: a small set of research-backed metrics that could be applied across organizations, compared over time, and used to drive improvement. Before DORA, engineering productivity conversations were dominated by unreliable proxies -- lines of code, story points, velocity charts -- with no empirical basis for what actually predicted performance. DORA changed that.

The four metrics are:

Deployment Frequency (DF) -- how often code is deployed to production
Lead Time for Changes (LT) -- how long it takes from code commit to code running in production
Mean Time to Recovery (MTTR) -- how quickly the team recovers from a failure in production
Change Failure Rate (CFR) -- what percentage of deployments cause a failure in production

The History of DORA

The Research Program (2014-2018)

The DORA research program was founded by Dr. Nicole Forsgren, Jez Humble, and Gene Kim. Beginning in 2014, the team conducted annual surveys of engineering organizations worldwide, collecting data on practices, capabilities, and outcomes. The research was rigorous -- applying statistical methods to identify causal relationships, not just correlations, between engineering practices and organizational performance.

The annual State of DevOps Reports -- published in partnership with various sponsors over the years -- became essential reading for engineering leaders. Each report refined the findings, added capabilities, and expanded the evidence base.

Accelerate (2018)

The definitive synthesis of DORA's research came in 2018 with the publication of Accelerate: The Science of Lean Software and DevOps by Forsgren, Humble, and Kim. The book laid out the four metrics, the research methodology, and the 24 capabilities that drive high performance. It became the standard reference for engineering measurement and remains widely cited today.

Accelerate established the classification system that most engineering organizations still use:

DORA performance classification (from Accelerate and subsequent State of DevOps Reports)
Metric	Elite	High	Medium	Low
Deployment Frequency	On demand (multiple per day)	Between once per day and once per week	Between once per week and once per month	Between once per month and once every 6 months
Lead Time for Changes	Less than one hour	Between one day and one week	Between one week and one month	Between one month and six months
Mean Time to Recovery	Less than one hour	Less than one day	Between one day and one week	Between one week and one month
Change Failure Rate	0-15%	16-30%	16-30%	46-60%

These benchmarks have been updated over the years through subsequent State of DevOps Reports, but the basic framework remains consistent.

Google Acquisition and Ongoing Research (2018-Present)

In 2018, DORA was acquired by Google Cloud. The research program continued as part of Google, with the DORA team publishing annual State of DevOps Reports, maintaining the DORA Quick Check tool, and expanding the research into areas like platform engineering, reliability, and -- more recently -- AI's impact on software delivery.

The Google-era reports introduced additional metrics and capabilities. The 2022 report added Reliability as a fifth outcome metric. Subsequent reports explored software supply chain security, platform engineering, and developer experience. But the four core metrics -- DF, LT, MTTR, CFR -- remained the foundation.

The Four Metrics in Detail

1. Deployment Frequency (DF)

What it measures: How often your organization deploys code to production.

Why it matters: Deployment frequency is a proxy for batch size and organizational agility. Teams that deploy frequently are, by implication, working in small batches -- each deployment represents a small, contained change rather than a massive release. Small batches reduce risk, enable faster feedback, and make it easier to isolate the cause of failures.

How to measure it: Count the number of successful deployments to production over a given time period. Report as deployments per day, per week, or per month depending on your organization's cadence.

What good looks like (pre-AI): - Elite: Multiple deployments per day - High: Between daily and weekly - Medium: Between weekly and monthly - Low: Between monthly and every six months

Nuances: Deployment frequency is influenced by organizational size, regulatory environment, and architecture. A startup deploying a monolith multiple times per day is not the same as an enterprise deploying hundreds of microservices. Context matters, and comparing raw DF numbers across organizations of different scales requires care.

2. Lead Time for Changes (LT)

What it measures: The elapsed time from when a developer commits code to when that code is running in production.

Why it matters: Lead time reflects the efficiency of the entire delivery pipeline -- from code commit through code review, CI/CD, testing, staging, and production deployment. Short lead times indicate a healthy, automated pipeline with minimal manual bottlenecks. Long lead times indicate friction -- slow reviews, manual testing, change approval boards, or infrastructure constraints.

How to measure it: Track the timestamp of the first commit for a change and the timestamp of successful production deployment. The difference is lead time. Report as a median or percentile (p50, p90) rather than an average, since outliers can skew averages significantly.

What good looks like (pre-AI): - Elite: Less than one hour - High: Between one day and one week - Medium: Between one week and one month - Low: Between one month and six months

Nuances: Lead time can be decomposed into sub-components -- coding time, review time, CI time, deployment time -- to pinpoint bottlenecks. This decomposition is increasingly important in AI-native teams where coding time has collapsed but review time has expanded.

3. Mean Time to Recovery (MTTR)

What it measures: How quickly your team restores service after a production incident.

Why it matters: Failures are inevitable. What distinguishes high-performing teams is not the absence of failure but the speed and effectiveness of recovery. Fast MTTR indicates robust incident response processes, good observability, well-designed rollback mechanisms, and a culture that prioritizes rapid resolution over blame.

How to measure it: For each production incident, record the time from detection to resolution. Report as a median or p90 across incidents over a given period. "Resolution" means the service is restored -- root cause analysis may continue after the clock stops.

What good looks like: - Elite: Less than one hour - High: Less than one day - Medium: Between one day and one week - Low: Between one week and one month

Nuances: MTTR is the DORA metric most resistant to AI distortion (see below). Recovery from incidents depends on human judgment, system architecture, observability tooling, and operational processes -- none of which are meaningfully inflated by AI code generation.

4. Change Failure Rate (CFR)

What it measures: The percentage of deployments that result in a failure requiring remediation -- a rollback, hotfix, patch, or incident.

Why it matters: Change failure rate reflects code quality, testing effectiveness, and the overall health of the delivery process. A low CFR means that changes are well-tested, well-reviewed, and consistently safe. A high CFR means that the team is shipping defects to production.

How to measure it: Count the number of deployments that caused a failure over a given period, divided by the total number of deployments. "Failure" includes anything that degrades service and requires intervention -- rollbacks, hotfixes, incident declarations.

What good looks like (pre-AI): - Elite: 0-15% - High: 16-30% - Medium: 16-30% - Low: 46-60%

Nuances: CFR captures failures that manifest immediately. It does not capture failures that emerge gradually -- code that passes tests and appears stable at deployment but causes subtle performance degradation, maintainability problems, or technical debt accumulation over the following weeks. This limitation becomes significant in the AI era.

What DORA Gets Right

Before examining DORA's limitations, it is worth being explicit about what the framework got right -- because it got a great deal right, and these contributions remain valuable.

Research-Backed, Not Opinion-Based

DORA metrics are grounded in multi-year, multi-thousand-organization research. This is not a consultant's opinion framework or a vendor's marketing construct. The correlations between DORA performance and business outcomes were established through rigorous statistical methods, controlling for confounding variables. This empirical foundation is why DORA became the standard -- and why it deserves respect even as its limitations become apparent.

Shared Language for Engineering Performance

Before DORA, engineering leaders had no standardized way to discuss delivery performance. Every organization had its own metrics, its own definitions, its own benchmarks. DORA provided a common vocabulary -- and that common vocabulary enabled cross-organizational comparison, industry benchmarking, and a shared understanding of what "good" looks like. This contribution alone justifies DORA's place in the history of software engineering.

Outcome-Oriented, Not Activity-Oriented

DORA metrics measure delivery outcomes, not developer activity. They do not count lines of code or commits or hours worked. They measure how frequently code ships, how quickly it reaches production, how reliably it performs, and how fast the team recovers when it does not. This outcome orientation was ahead of its time and remains a design principle worth preserving.

The Four Metrics Are Interconnected

DORA's four metrics work as a system. You cannot optimize deployment frequency at the expense of change failure rate without the metrics catching the trade-off. You cannot reduce lead time by skipping testing without CFR revealing the consequences. This systems-level design prevents the single-metric gaming that plagued earlier frameworks.

Where DORA Stands in 2026: What Has Changed

DORA metrics were designed for a world where humans wrote all the code. That world began changing in late 2022 with the mainstream adoption of AI coding tools and has changed fundamentally by 2026. The issue is not that DORA produces wrong numbers. The metrics still measure exactly what they always measured. The problem is that what they measure no longer means what it used to mean.

For the detailed analysis of each metric's distortion, see Why DORA Metrics Break in the AI Era. Below is the summary view.

What Still Works: MTTR

Mean Time to Recovery remains the most reliable DORA metric in the AI era. Recovery from production incidents depends on human judgment, system architecture, observability, and operational processes. AI code generation does not meaningfully inflate or distort these factors. A team's MTTR is still a genuine signal of operational maturity.

MTTR may eventually be affected as AI tools become more integrated into incident response -- automated rollbacks, AI-assisted diagnosis -- but as of 2026, the metric's integrity is largely intact.

What Partially Works: Change Failure Rate

Change Failure Rate retains value but has a growing blind spot. CFR captures failures that manifest immediately after deployment -- outages, crashes, error rate spikes. What it does not capture is the more insidious pattern of AI-generated code that passes tests, deploys successfully, and then degrades quietly over the following days and weeks.

GitClear's research documents this pattern: code churn -- the percentage of code rewritten or deleted within weeks of being committed -- has risen from a pre-AI baseline of 3.3% to 5.7-7.1%. This churn represents code that "worked" at deployment (it would not have increased CFR) but was silently replaced shortly afterward. CFR sees the crash. It does not see the quiet rewrite.

CFR is still worth tracking -- catastrophic failures still matter, and CFR captures them. But it must be paired with code turnover rate to capture the full quality picture.

What Has Become Misleading: Deployment Frequency

Deployment frequency is the DORA metric most distorted by AI adoption. When AI tools can generate boilerplate, test scaffolding, configuration changes, and simple features in minutes, deployment counts inflate without a corresponding increase in meaningful output.

A concrete example: a team that deployed five times per week before AI adoption now deploys twenty times per week. By DORA standards, they have jumped from "medium" to "elite" performance. But if fifteen of those additional deploys are AI-generated boilerplate and configuration changes, the team has not become three times more capable. They have become better at shipping low-complexity work.

Deployment frequency still indicates pipeline health -- the CI/CD infrastructure supports frequent releases. But as a proxy for team capability or business value, it is increasingly unreliable without complexity weighting.

What Has Become Misleading: Lead Time for Changes

Lead time has a similar problem. When AI generates code in seconds, the coding phase of lead time collapses to near-zero. Lead time drops dramatically -- but the improvement reflects the speed of AI code generation, not the efficiency of the delivery pipeline.

Paradoxically, the component of lead time that matters most in AI-native teams -- review time -- is often increasing even as total lead time decreases. Code is generated faster, but reviewers face higher volumes of code they did not write. The bottleneck has shifted from creation to review, and lead time as a single number obscures this shift rather than revealing it.

Extending DORA for AI-Native Teams

The right response to DORA's limitations is not to abandon the framework. It is to extend it. DORA's principles -- outcome-oriented measurement, interconnected metrics, research-backed benchmarks -- remain sound. What needs to change are the specific metrics and the assumptions behind them.

Keep MTTR and CFR

Both metrics retain value. MTTR is largely unaffected by AI. CFR, while incomplete, still captures catastrophic failures. Keep them as part of the measurement system.

Add Code Turnover Rate

Code Turnover Rate fills the gap that CFR misses: code that deploys successfully but does not survive. Track the percentage of committed code that is rewritten or deleted within 14 and 30 days. Segment by AI-generated vs. human-written. A code turnover rate under 3% indicates durable code. Above 7% indicates significant engineering waste -- regardless of what CFR shows.

Replace DF and LT with Complexity-Adjusted Throughput

Rather than counting deployments or measuring time-to-deploy, measure the complexity-weighted value of what was deployed. Complexity-Adjusted Throughput (CAT) assigns difficulty weights to each PR -- Easy (1 point), Medium (3 points), Hard (8 points) -- and tracks weighted output per engineer per week. This metric is resistant to AI inflation by design: AI excels at Easy work, so inflating Easy PR counts does not meaningfully increase CAT scores.

Add AI Attribution

Every metric in the extended framework should be segmentable by AI attribution. Track AI code share -- the percentage of committed code that was AI-generated -- and use it as a lens on all other metrics. Code turnover rate for AI code vs. human code. CAT scores for AI-assisted vs. human-only PRs. Without this segmentation, every metric remains ambiguous.

Track Innovation Rate

Innovation Rate -- the ratio of time spent on new features versus bug fixes and maintenance -- captures whether AI adoption is freeing engineers to work on high-value work or generating additional maintenance burden. A team where AI increases velocity but innovation rate declines is running faster on a treadmill.

DORA in Practice: Implementation Guidance

Getting Started

If your organization does not currently track DORA metrics, start with the basics:

Deployment Frequency: Count deployments per week per service or per team. Most CI/CD platforms provide this data natively.
Lead Time: Track time from first commit to production deployment. Git metadata and deployment logs provide the raw data.
MTTR: Record incident start and resolution times. Your incident management tool (PagerDuty, OpsGenie, etc.) already has this data.
Change Failure Rate: Count deployments that caused incidents, divided by total deployments.

This baseline gives you a starting point and establishes the measurement discipline needed for more sophisticated metrics.

Common Pitfalls

Treating DORA as a leaderboard. DORA metrics are diagnostic tools, not competition rankings. Teams that optimize for the metrics rather than for the outcomes the metrics are supposed to reflect will game the numbers -- splitting PRs to inflate DF, skipping tests to reduce LT, reclassifying incidents to improve CFR.

Ignoring context. A platform team and a product team have structurally different DORA profiles. A regulated financial services team and a consumer SaaS startup operate under different constraints. Raw DORA comparisons across teams without context are misleading.

Using DORA for individual performance evaluation. DORA metrics are team-level and organizational-level measures. Applying them to individual developers creates perverse incentives and distorted behavior.

Stopping at DORA. In 2026, DORA alone is insufficient for teams with significant AI adoption. If AI generates 30-70% of your committed code, you need the additional metrics described above to get an accurate picture of delivery performance.

Frequently Asked Questions

What are DORA metrics?

DORA metrics are four metrics for measuring software delivery performance: Deployment Frequency (how often you deploy), Lead Time for Changes (how long from commit to production), Mean Time to Recovery (how quickly you recover from failures), and Change Failure Rate (what percentage of deployments cause failures). They were developed by the DevOps Research and Assessment team and published in the 2018 book [*Accelerate*](https://itrevolution.com/product/accelerate/) by Dr. Nicole Forsgren, Jez Humble, and Gene Kim.

Who created DORA metrics?

DORA metrics were created by the DevOps Research and Assessment (DORA) team, led by Dr. Nicole Forsgren, Jez Humble, and Gene Kim. The research program began in 2014, with findings published through annual State of DevOps Reports and synthesized in the 2018 book [*Accelerate*](https://itrevolution.com/product/accelerate/). The DORA team was acquired by Google Cloud in 2018 and continues publishing research at [dora.dev](https://dora.dev).

Are DORA metrics still relevant in 2026?

Partially. MTTR remains fully relevant -- recovery from incidents still depends on human judgment and system architecture. Change Failure Rate retains value for capturing catastrophic failures but misses the "quiet churn" pattern of AI-generated code that deploys successfully but gets rewritten within days. Deployment Frequency and Lead Time have become misleading when AI generates a significant share of code, because they measure how fast code ships rather than how much value is delivered. The framework's principles remain sound, but the metrics need to be extended with AI attribution, code durability, and complexity-adjusted throughput. See [Why DORA Metrics Break in the AI Era](/developer-productivity/why-dora-metrics-break-ai-era) for the full analysis.

What is the difference between Elite, High, Medium, and Low DORA performance?

The DORA classification system ranks teams based on their performance across the four metrics. Elite teams deploy on demand (multiple times per day), have lead times under one hour, recover from failures in under one hour, and have change failure rates of 0-15%. Low performers deploy between monthly and every six months, have lead times of one to six months, take one week to one month to recover, and have failure rates of 46-60%. The original benchmarks were established in *Accelerate* and refined through subsequent State of DevOps Reports.

How do you start tracking DORA metrics?

Start with data you already have. Deployment Frequency comes from your CI/CD platform. Lead Time comes from git commit timestamps and deployment logs. MTTR comes from your incident management tool. Change Failure Rate is calculated by dividing failed deployments by total deployments. Most organizations can establish a DORA baseline within two to four weeks using existing tooling without purchasing additional platforms.

Should you replace DORA metrics with something else?

No -- extend, do not replace. DORA's principles (outcome-oriented, interconnected, research-backed) remain sound. What needs to change are the specific metrics for teams with significant AI adoption. Keep MTTR and CFR. Supplement with [Code Turnover Rate](/developer-productivity/code-turnover-rate-ai-quality-metric) for quality, [Complexity-Adjusted Throughput](/developer-productivity/complexity-adjusted-throughput-metric) for velocity, and [AI Code Share](/developer-productivity/ai-code-share-percentage-ai-generated) for attribution. [The Developer AI Impact Framework](/developer-productivity/developer-ai-impact-framework) provides the structure for this extension.

What is the relationship between DORA and the SPACE framework?

DORA and [SPACE](https://queue.acm.org/detail.cfm?id=3454124) are complementary frameworks, not competitors. DORA measures software delivery performance -- how effectively code moves from commit to production. SPACE measures developer productivity more broadly across five dimensions: Satisfaction, Performance, Activity, Communication, and Efficiency. Dr. Nicole Forsgren co-authored both frameworks. In practice, many organizations use DORA for delivery metrics and SPACE for the broader productivity picture. Both frameworks face limitations in the AI era, as outlined in [Why DORA Metrics Break](/developer-productivity/why-dora-metrics-break-ai-era) and [SPACE Framework Limitations](/developer-productivity/space-framework-limitations-ai-era).