DORA metrics are a set of four metrics that measure software delivery performance. They were developed by the DevOps Research and Assessment (DORA) team -- a research program that, starting in 2014, surveyed thousands of engineering organizations to identify the capabilities and practices that predict high performance in software delivery.
The core finding was straightforward and powerful: organizations that excel at software delivery -- shipping code frequently, quickly, reliably, and with fast recovery from failures -- outperform their peers on every business metric that matters. Revenue growth, market share, profitability, and employee satisfaction all correlate with strong software delivery performance.
DORA gave engineering leaders something the industry had never had: a small set of research-backed metrics that could be applied across organizations, compared over time, and used to drive improvement. Before DORA, engineering productivity conversations were dominated by unreliable proxies -- lines of code, story points, velocity charts -- with no empirical basis for what actually predicted performance. DORA changed that.
The four metrics are:
The DORA research program was founded by Dr. Nicole Forsgren, Jez Humble, and Gene Kim. Beginning in 2014, the team conducted annual surveys of engineering organizations worldwide, collecting data on practices, capabilities, and outcomes. The research was rigorous -- applying statistical methods to identify causal relationships, not just correlations, between engineering practices and organizational performance.
The annual State of DevOps Reports -- published in partnership with various sponsors over the years -- became essential reading for engineering leaders. Each report refined the findings, added capabilities, and expanded the evidence base.
The definitive synthesis of DORA's research came in 2018 with the publication of Accelerate: The Science of Lean Software and DevOps by Forsgren, Humble, and Kim. The book laid out the four metrics, the research methodology, and the 24 capabilities that drive high performance. It became the standard reference for engineering measurement and remains widely cited today.
Accelerate established the classification system that most engineering organizations still use:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | On demand (multiple per day) | Between once per day and once per week | Between once per week and once per month | Between once per month and once every 6 months |
| Lead Time for Changes | Less than one hour | Between one day and one week | Between one week and one month | Between one month and six months |
| Mean Time to Recovery | Less than one hour | Less than one day | Between one day and one week | Between one week and one month |
| Change Failure Rate | 0-15% | 16-30% | 16-30% | 46-60% |
These benchmarks have been updated over the years through subsequent State of DevOps Reports, but the basic framework remains consistent.
In 2018, DORA was acquired by Google Cloud. The research program continued as part of Google, with the DORA team publishing annual State of DevOps Reports, maintaining the DORA Quick Check tool, and expanding the research into areas like platform engineering, reliability, and -- more recently -- AI's impact on software delivery.
The Google-era reports introduced additional metrics and capabilities. The 2022 report added Reliability as a fifth outcome metric. Subsequent reports explored software supply chain security, platform engineering, and developer experience. But the four core metrics -- DF, LT, MTTR, CFR -- remained the foundation.
What it measures: How often your organization deploys code to production.
Why it matters: Deployment frequency is a proxy for batch size and organizational agility. Teams that deploy frequently are, by implication, working in small batches -- each deployment represents a small, contained change rather than a massive release. Small batches reduce risk, enable faster feedback, and make it easier to isolate the cause of failures.
How to measure it: Count the number of successful deployments to production over a given time period. Report as deployments per day, per week, or per month depending on your organization's cadence.
What good looks like (pre-AI): - Elite: Multiple deployments per day - High: Between daily and weekly - Medium: Between weekly and monthly - Low: Between monthly and every six months
Nuances: Deployment frequency is influenced by organizational size, regulatory environment, and architecture. A startup deploying a monolith multiple times per day is not the same as an enterprise deploying hundreds of microservices. Context matters, and comparing raw DF numbers across organizations of different scales requires care.
What it measures: The elapsed time from when a developer commits code to when that code is running in production.
Why it matters: Lead time reflects the efficiency of the entire delivery pipeline -- from code commit through code review, CI/CD, testing, staging, and production deployment. Short lead times indicate a healthy, automated pipeline with minimal manual bottlenecks. Long lead times indicate friction -- slow reviews, manual testing, change approval boards, or infrastructure constraints.
How to measure it: Track the timestamp of the first commit for a change and the timestamp of successful production deployment. The difference is lead time. Report as a median or percentile (p50, p90) rather than an average, since outliers can skew averages significantly.
What good looks like (pre-AI): - Elite: Less than one hour - High: Between one day and one week - Medium: Between one week and one month - Low: Between one month and six months
Nuances: Lead time can be decomposed into sub-components -- coding time, review time, CI time, deployment time -- to pinpoint bottlenecks. This decomposition is increasingly important in AI-native teams where coding time has collapsed but review time has expanded.
What it measures: How quickly your team restores service after a production incident.
Why it matters: Failures are inevitable. What distinguishes high-performing teams is not the absence of failure but the speed and effectiveness of recovery. Fast MTTR indicates robust incident response processes, good observability, well-designed rollback mechanisms, and a culture that prioritizes rapid resolution over blame.
How to measure it: For each production incident, record the time from detection to resolution. Report as a median or p90 across incidents over a given period. "Resolution" means the service is restored -- root cause analysis may continue after the clock stops.
What good looks like: - Elite: Less than one hour - High: Less than one day - Medium: Between one day and one week - Low: Between one week and one month
Nuances: MTTR is the DORA metric most resistant to AI distortion (see below). Recovery from incidents depends on human judgment, system architecture, observability tooling, and operational processes -- none of which are meaningfully inflated by AI code generation.
What it measures: The percentage of deployments that result in a failure requiring remediation -- a rollback, hotfix, patch, or incident.
Why it matters: Change failure rate reflects code quality, testing effectiveness, and the overall health of the delivery process. A low CFR means that changes are well-tested, well-reviewed, and consistently safe. A high CFR means that the team is shipping defects to production.
How to measure it: Count the number of deployments that caused a failure over a given period, divided by the total number of deployments. "Failure" includes anything that degrades service and requires intervention -- rollbacks, hotfixes, incident declarations.
What good looks like (pre-AI): - Elite: 0-15% - High: 16-30% - Medium: 16-30% - Low: 46-60%
Nuances: CFR captures failures that manifest immediately. It does not capture failures that emerge gradually -- code that passes tests and appears stable at deployment but causes subtle performance degradation, maintainability problems, or technical debt accumulation over the following weeks. This limitation becomes significant in the AI era.
Before examining DORA's limitations, it is worth being explicit about what the framework got right -- because it got a great deal right, and these contributions remain valuable.
DORA metrics are grounded in multi-year, multi-thousand-organization research. This is not a consultant's opinion framework or a vendor's marketing construct. The correlations between DORA performance and business outcomes were established through rigorous statistical methods, controlling for confounding variables. This empirical foundation is why DORA became the standard -- and why it deserves respect even as its limitations become apparent.
Before DORA, engineering leaders had no standardized way to discuss delivery performance. Every organization had its own metrics, its own definitions, its own benchmarks. DORA provided a common vocabulary -- and that common vocabulary enabled cross-organizational comparison, industry benchmarking, and a shared understanding of what "good" looks like. This contribution alone justifies DORA's place in the history of software engineering.
DORA metrics measure delivery outcomes, not developer activity. They do not count lines of code or commits or hours worked. They measure how frequently code ships, how quickly it reaches production, how reliably it performs, and how fast the team recovers when it does not. This outcome orientation was ahead of its time and remains a design principle worth preserving.
DORA's four metrics work as a system. You cannot optimize deployment frequency at the expense of change failure rate without the metrics catching the trade-off. You cannot reduce lead time by skipping testing without CFR revealing the consequences. This systems-level design prevents the single-metric gaming that plagued earlier frameworks.
DORA metrics were designed for a world where humans wrote all the code. That world began changing in late 2022 with the mainstream adoption of AI coding tools and has changed fundamentally by 2026. The issue is not that DORA produces wrong numbers. The metrics still measure exactly what they always measured. The problem is that what they measure no longer means what it used to mean.
For the detailed analysis of each metric's distortion, see Why DORA Metrics Break in the AI Era. Below is the summary view.
Mean Time to Recovery remains the most reliable DORA metric in the AI era. Recovery from production incidents depends on human judgment, system architecture, observability, and operational processes. AI code generation does not meaningfully inflate or distort these factors. A team's MTTR is still a genuine signal of operational maturity.
MTTR may eventually be affected as AI tools become more integrated into incident response -- automated rollbacks, AI-assisted diagnosis -- but as of 2026, the metric's integrity is largely intact.
Change Failure Rate retains value but has a growing blind spot. CFR captures failures that manifest immediately after deployment -- outages, crashes, error rate spikes. What it does not capture is the more insidious pattern of AI-generated code that passes tests, deploys successfully, and then degrades quietly over the following days and weeks.
GitClear's research documents this pattern: code churn -- the percentage of code rewritten or deleted within weeks of being committed -- has risen from a pre-AI baseline of 3.3% to 5.7-7.1%. This churn represents code that "worked" at deployment (it would not have increased CFR) but was silently replaced shortly afterward. CFR sees the crash. It does not see the quiet rewrite.
CFR is still worth tracking -- catastrophic failures still matter, and CFR captures them. But it must be paired with code turnover rate to capture the full quality picture.
Deployment frequency is the DORA metric most distorted by AI adoption. When AI tools can generate boilerplate, test scaffolding, configuration changes, and simple features in minutes, deployment counts inflate without a corresponding increase in meaningful output.
A concrete example: a team that deployed five times per week before AI adoption now deploys twenty times per week. By DORA standards, they have jumped from "medium" to "elite" performance. But if fifteen of those additional deploys are AI-generated boilerplate and configuration changes, the team has not become three times more capable. They have become better at shipping low-complexity work.
Deployment frequency still indicates pipeline health -- the CI/CD infrastructure supports frequent releases. But as a proxy for team capability or business value, it is increasingly unreliable without complexity weighting.
Lead time has a similar problem. When AI generates code in seconds, the coding phase of lead time collapses to near-zero. Lead time drops dramatically -- but the improvement reflects the speed of AI code generation, not the efficiency of the delivery pipeline.
Paradoxically, the component of lead time that matters most in AI-native teams -- review time -- is often increasing even as total lead time decreases. Code is generated faster, but reviewers face higher volumes of code they did not write. The bottleneck has shifted from creation to review, and lead time as a single number obscures this shift rather than revealing it.
The right response to DORA's limitations is not to abandon the framework. It is to extend it. DORA's principles -- outcome-oriented measurement, interconnected metrics, research-backed benchmarks -- remain sound. What needs to change are the specific metrics and the assumptions behind them.
Both metrics retain value. MTTR is largely unaffected by AI. CFR, while incomplete, still captures catastrophic failures. Keep them as part of the measurement system.
Code Turnover Rate fills the gap that CFR misses: code that deploys successfully but does not survive. Track the percentage of committed code that is rewritten or deleted within 14 and 30 days. Segment by AI-generated vs. human-written. A code turnover rate under 3% indicates durable code. Above 7% indicates significant engineering waste -- regardless of what CFR shows.
Rather than counting deployments or measuring time-to-deploy, measure the complexity-weighted value of what was deployed. Complexity-Adjusted Throughput (CAT) assigns difficulty weights to each PR -- Easy (1 point), Medium (3 points), Hard (8 points) -- and tracks weighted output per engineer per week. This metric is resistant to AI inflation by design: AI excels at Easy work, so inflating Easy PR counts does not meaningfully increase CAT scores.
Every metric in the extended framework should be segmentable by AI attribution. Track AI code share -- the percentage of committed code that was AI-generated -- and use it as a lens on all other metrics. Code turnover rate for AI code vs. human code. CAT scores for AI-assisted vs. human-only PRs. Without this segmentation, every metric remains ambiguous.
Innovation Rate -- the ratio of time spent on new features versus bug fixes and maintenance -- captures whether AI adoption is freeing engineers to work on high-value work or generating additional maintenance burden. A team where AI increases velocity but innovation rate declines is running faster on a treadmill.
If your organization does not currently track DORA metrics, start with the basics:
This baseline gives you a starting point and establishes the measurement discipline needed for more sophisticated metrics.
Treating DORA as a leaderboard. DORA metrics are diagnostic tools, not competition rankings. Teams that optimize for the metrics rather than for the outcomes the metrics are supposed to reflect will game the numbers -- splitting PRs to inflate DF, skipping tests to reduce LT, reclassifying incidents to improve CFR.
Ignoring context. A platform team and a product team have structurally different DORA profiles. A regulated financial services team and a consumer SaaS startup operate under different constraints. Raw DORA comparisons across teams without context are misleading.
Using DORA for individual performance evaluation. DORA metrics are team-level and organizational-level measures. Applying them to individual developers creates perverse incentives and distorted behavior.
Stopping at DORA. In 2026, DORA alone is insufficient for teams with significant AI adoption. If AI generates 30-70% of your committed code, you need the additional metrics described above to get an accurate picture of delivery performance.