DORA and SPACE are foundational frameworks. They shaped how an entire generation of engineering leaders think about productivity, and for good reason. When those frameworks were developed, every line of code in your codebase was written by a human. Every commit reflected human effort. Every PR represented a human decision.
That assumption no longer holds.
In 2026, AI coding tools generate 30-70% of committed code in high-adoption organizations. GitHub Copilot, Cursor, and Claude Code don't just autocomplete variable names -- they generate entire functions, test suites, and boilerplate modules. This is a structural shift in how code gets produced, and it requires a structural shift in how productivity gets measured.
What makes this urgent is not just that traditional frameworks produce inaccurate numbers. It is that the outcome variance in AI-assisted engineering is enormous. Two teams can adopt the same tools, spend similar budgets, and get radically different results:
Both teams look identical on traditional metrics. PRs are up. Deployment frequency is up. Activity is through the roof. The difference between them is invisible to DORA, SPACE, or any volume-based measurement system.
"Feels faster, isn't" is the most expensive failure mode in AI-assisted engineering. Code churn has nearly doubled since AI coding adoption went mainstream (3.3% to 5.7-7.1%, per GitClear). AI tool costs now range from trivial ($20-60/month for inline completion) to substantial ($200-$2,000+/month per engineer for agentic tools). When the investment is this large and the outcome variance is this wide, measurement is not a reporting exercise -- it is the mechanism that determines which outcome you get.
DORA's four metrics -- Deployment Frequency, Lead Time for Changes, Mean Time to Recovery, and Change Failure Rate -- were designed to measure software delivery performance. They served us well. But three of the four are now distorted by AI-generated code:
Only MTTR remains relatively unaffected, because recovery from incidents still depends primarily on human judgment and system architecture.
The SPACE framework introduced a valuable multi-dimensional approach: Satisfaction, Performance, Activity, Communication, and Efficiency. Its insight -- that productivity is multidimensional and that developer experience matters -- remains correct.
But SPACE's Activity dimension (commits, PRs, lines of code) directly rewards volume. When AI can produce ten PRs in the time a human writes one, Activity metrics become noise. SPACE's limitations in the AI era are not a failure of the framework's principles. They are a consequence of building on assumptions that no longer hold.
GitClear's analysis of over one billion lines of code found that code churn -- the percentage of code that is rewritten or deleted within weeks of being committed -- has risen from a pre-AI baseline of 3.3% in 2021 to between 5.7% and 7.1% by 20241. This is not a marginal increase. It represents a near-doubling of wasted engineering effort, and it aligns precisely with the timeline of widespread AI coding tool adoption.
The implication is straightforward: teams are producing more code, but a growing share of that code does not survive. Any productivity framework that counts output without measuring durability is producing misleading results.
Before measuring productivity, engineering leaders need a model for understanding what they are actually trying to achieve with AI tools. The AI Impact Hierarchy describes five ascending levels of AI impact, each building on the one below:
| Level | Name | Core Question | What Most Orgs Do |
|---|---|---|---|
| 1 | Adoption | Are developers using AI tools? | Measure this and stop |
| 2 | Engagement | How deeply are they using AI? | Rarely measured |
| 3 | Productivity | Is AI accelerating real output? | Measured with inflated metrics |
| 4 | Quality | Is AI code maintainable? | Almost never measured |
| 5 | Business Value | Is the AI investment paying off? | Guessed, not measured |
Most organizations measure Level 1 -- adoption -- and declare success. They know that 60% of their developers have used Copilot this month. What they do not know is whether those developers are using it for trivial autocompletions or complex feature work (Level 2), whether it is genuinely accelerating delivery (Level 3), whether the code it produces survives beyond the first sprint (Level 4), or whether the investment is generating a positive return (Level 5).
The Developer AI Impact Framework is designed to measure across all five levels. Each pillar corresponds to one or more levels of the hierarchy, ensuring that organizations do not confuse adoption theater with actual productivity gains. For a deeper exploration of this maturity model, see What Is the AI Impact Hierarchy?.
What it measures: Whether developers are actually using AI coding tools, how frequently, and how usage distributes across the team.
Adoption is the prerequisite. If developers are not using AI tools, nothing else in this framework matters. But adoption alone tells you almost nothing about productivity. A team with 70% weekly active users might be seeing transformative gains, or it might be seeing 70% of developers accepting low-quality autocomplete suggestions.
Key Metrics:
Red Flags:
Benchmarks:
| Metric | Industry Average | Top Quartile | Target (90 Days) |
|---|---|---|---|
| WAU | 30-40% | 60-70% | >50% |
| DAU | 15-20% | 35-45% | >25% |
| Power User % (daily, multi-mode) | 10-15% | 25-35% | >20% |
| Non-User % | 40-50% | 15-25% | <30% |
Adoption measurement is the entry point, but it is not the destination. Once you know developers are using AI tools, the next question is what percentage of your codebase they are actually producing with those tools. That is Pillar 2.
What it measures: The percentage of committed code that was generated or substantially assisted by AI tools, measured at the line, commit, and pull request level.
AI Code Share answers a question that adoption metrics cannot: how much of your actual codebase is AI-produced? A team can have 70% WAU but only 10% AI-assisted code (indicating shallow usage), or 40% WAU but 50% AI-assisted code (indicating a smaller group of deeply effective power users).
Key Metrics:
Red Flags:
Benchmarks:
| Metric | Industry Average | High-Adoption Orgs | Power Users |
|---|---|---|---|
| AI-Assisted PRs % | 20-35% | 50-70% | 80-95% |
| AI-Assisted Lines % | 15-25% | 40-60% | Up to 90%2 |
| AI-Assisted Commits % | 15-25% | 35-55% | 70-85% |
The distinction between AI-assisted lines and AI-assisted PRs matters. A PR might contain 200 AI-generated boilerplate lines and 20 carefully crafted human lines. The PR is AI-assisted, but the high-value work may still be human. This is why AI Code Share must be interpreted alongside Pillar 3 (velocity) and Pillar 4 (quality), never in isolation.
What it measures: Engineering output weighted by the complexity of the work delivered, segmented by AI-assisted versus human-only contributions.
Raw velocity metrics -- PRs merged, commits shipped, lines of code written -- are now unreliable proxies for engineering output. When AI can generate a 500-line PR in minutes, counting PRs tells you how prolific your AI tools are, not how productive your engineers are.
Complexity-Adjusted Throughput (CAT) addresses this by weighting each unit of work according to its complexity:
| Complexity Level | Points | Examples |
|---|---|---|
| Easy | 1 pt | Config changes, copy updates, simple boilerplate, dependency bumps |
| Medium | 3 pts | Feature additions with moderate logic, API integrations, test suites |
| Hard | 8 pts | Architectural changes, performance optimizations, complex algorithms, system migrations |
CAT per engineer per week becomes the primary velocity metric. It reflects the value of work delivered, not just the volume.
Key Metrics:
Red Flags:
Benchmarks:
| Metric | Industry Average | Top Quartile | Notes |
|---|---|---|---|
| CAT per Engineer (weekly) | 8 pts | 14+ pts | Measured across all merged PRs |
| AI-Assisted CAT Share | 25-35% | 50-65% | % of total CAT from AI-assisted work |
| Cycle Time (overall) | 4-6 days | 1-2 days | Commit to deploy |
| Cycle Time (to first review) | 18-24 hours | 4-8 hours | Commit to first human review |
A team averaging 8 CAT points per engineer per week with a rising trend is performing at or above industry average. A team at 14+ points is in the top quartile. But CAT only tells you about output speed. It says nothing about whether that output survives. That is where Pillar 4 comes in.
What it measures: The durability of AI-generated code -- whether it persists in the codebase or gets rewritten, reverted, or deleted shortly after being merged.
Quality is the pillar that most organizations miss entirely. They measure how much code AI produces and how fast it ships, but they do not ask the critical follow-up question: does it stick?
Code Turnover Rate is the primary quality metric. It measures the percentage of committed code that is substantially rewritten or deleted within 30 or 90 days of being merged, tracked separately for AI-generated and human-written code.
Key Metrics:
Red Flags:
Benchmarks:
| Metric | Pre-AI Baseline | Industry Average (2026) | Healthy Target |
|---|---|---|---|
| Overall Code Turnover (30D) | 3.3%1 | 5.7-7.1% | <12% |
| AI Code Turnover (30D) | N/A | 12-18% | <15% |
| Human Code Turnover (30D) | 3.3% | 4-6% | <8% |
| AI-to-Human Turnover Ratio | N/A | 1.8-2.5x | <1.5x |
| Innovation Rate | 50-60% | 45-55% | >50% |
The code churn data from GitClear confirms a pattern that many engineering leaders intuitively suspect: AI tools are producing more code, but a significant portion of that code is not production-durable. This does not mean AI tools are failing. It means that quality measurement must be built into any productivity framework from day one, not added as an afterthought.
What it measures: The financial return on AI tool investment, accounting for both the value generated and the hidden costs of code rework.
AI tool costs are no longer trivial. Seat-based licenses for inline completion tools (Copilot, Cursor) run $20-60 per engineer per month. But agentic AI tools -- Claude Code, Cursor with high-autonomy agents, custom LLM pipelines -- introduce usage-based token costs that range from $200 to $2,000+ per engineer per month depending on usage intensity. A 50-person team can easily spend $10,000-$50,000 per month on AI tooling, making rigorous ROI justification essential rather than optional.
Engineering leaders are under increasing pressure to justify this spend. Pillar 5 provides the framework for doing so honestly -- accounting not just for time saved but also for the real cost structure of modern AI tooling and the rework costs introduced by low-quality AI-generated code.
Key Metrics:
ROI Formula:
Net ROI = (Productive Value of Time Saved - Rework Cost from Code Turnover) / Total AI Tool Cost
For a deeper guide on structuring this calculation, see How to Measure AI Coding Tool ROI.
Red Flags:
Benchmarks:
| Metric | Industry Average | Top Quartile | Red Flag |
|---|---|---|---|
| Net ROI Multiplier | 2.5-3.5x | 4-6x | <2x after 90 days |
| Time Saved per Engineer | 4-6 hrs/week | 8-12 hrs/week | <2 hrs/week |
| Total AI Cost per Engineer (monthly) | $200-600 | $200-600 | >$1,000 without proportional gain |
| Rework Cost as % of Value | 15-25% | 5-10% | >30% |
Cost Structure Reality:
AI tool costs now fall into three tiers, and most teams use tools from more than one:
| Tier | Examples | Typical Cost / Engineer / Month |
|---|---|---|
| Inline completion | GitHub Copilot, Cursor Pro | $20-60 (seat-based) |
| Chat + agentic assist | Cursor Business, Windsurf | $40-100 (seat-based) |
| High-autonomy agentic | Claude Code, custom LLM pipelines | $200-2,000+ (usage-based) |
Engineers using agentic tools heavily can generate $500-$2,000/month in token costs alone. This is not a problem -- agentic tools deliver disproportionately more value on Medium and Hard work -- but it means the cost denominator in your ROI calculation must reflect actual spend, not list price.
Example Calculation:
Consider a 50-person engineering team using a mix of inline completion and agentic AI tools:
| Component | Value |
|---|---|
| Inline completion licenses | 50 engineers x $40/month = $2,000/month |
| Agentic tool usage (token costs) | 50 engineers x $400/month avg = $20,000/month |
| Implementation overhead (training, admin, integration) | $5,000/month |
| Total cost | $27,000/month |
| Time saved (conservative) | 50 engineers x 5 hrs/week x $85/hr loaded cost = $21,250/week = $85,000/month |
| Productivity conversion (not all saved time is productive) | 60% utilization factor = $51,000/month |
| Less: Rework cost from AI code turnover | 15% of value = -$7,650/month |
| Net value | $43,350/month |
| ROI | $43,350 / $27,000 = ~1.6x |
This is the honest math. At 1.6x, AI tools are generating positive returns -- but not the 10x that vendor marketing claims. And notice how sensitive the calculation is to its inputs: if your team saves 7 hours per week instead of 5, ROI jumps to 2.6x. If your rework rate drops from 15% to 8% through better prompt engineering and review standards, ROI climbs further. If token costs run higher than $400/month average because engineers are using agentic tools for low-value tasks, ROI drops below breakeven.
This sensitivity is exactly why Pillar 5 matters. The difference between a 1.6x and a 4x ROI is not the tool -- it is how well your team uses it, and whether you have the quality gates (Pillar 4) to prevent rework from eating the gains.
The key insight: organizations that skip quality measurement systematically overstate ROI by 20-40% because they undercount costs (ignoring token spend) and do not account for rework. Including both the real cost structure and the rework deduction is what separates credible ROI analysis from advocacy math. See AI Value Realization Score for how to distill this into a single executive metric.
Telemetry shows what is happening. Surveys show why.
A team might show declining CAT scores, and telemetry alone can confirm the trend. But only a developer survey can reveal whether the root cause is tooling friction, poor prompt engineering skills, organizational resistance, or a change in the type of work being assigned. For a comprehensive guide to structuring these surveys, see Developer Experience Surveys for AI-Native Teams.
Five Survey Dimensions:
| Dimension | What It Captures | Example Question |
|---|---|---|
| Perceived Time Savings | Developer's estimate of hours saved | "How many hours did AI tools save you this week?" |
| Post-Acceptance Edit Rate | How often AI output needs reworking | "How often do you substantially edit AI suggestions before committing?" |
| Task Fit | Which tasks AI helps with most | "For which types of work do AI tools help you the most?" |
| Adoption Barriers | What prevents deeper usage | "What prevents you from using AI tools more effectively?" |
| AI Tool NPS | Overall sentiment and recommendation intent | "How likely are you to recommend this AI tool to a colleague?" (0-10) |
Survey Cadence:
| Cadence | Questions | Time to Complete | Purpose |
|---|---|---|---|
| Biweekly pulse | 2-3 | <30 seconds | Track time savings and sentiment trends |
| Monthly check-in | 5-7 | 2-3 minutes | Deeper read on adoption barriers and task fit |
| Quarterly diagnostic | 10-15 | 5-7 minutes | Comprehensive assessment with freeform feedback |
Survey data pairs directly with telemetry. A developer who reports saving 8 hours per week but whose code turnover rate is 25% is generating volume, not value. A developer who reports saving only 2 hours but whose turnover rate is 3% is using AI precisely and effectively. Neither data source tells the full story alone.
Engineering leaders need a single number they can report to executives that answers the question: "Is our AI investment actually working?"
The AI Value Realization Score combines telemetry and survey data into one composite metric:
AI Value Realization Score = (WAU Rate x 0.2) + (AI-Assisted Code Rate x 0.2)
+ (Perceived Time Savings Index x 0.3) + (Quality Score x 0.3)
Where:
Quality Score = 100 - (Turnover Rate x 2) - (Post-Acceptance Edit Rate x 1.5)
Component Breakdown:
| Component | Weight | Data Source | Rationale |
|---|---|---|---|
| WAU Rate | 20% | Tool telemetry | Adoption is necessary but not sufficient |
| AI-Assisted Code Rate | 20% | Git analysis | Depth of integration into actual work |
| Perceived Time Savings Index | 30% | Developer surveys | Developer-reported value (highest correlation with satisfaction) |
| Quality Score | 30% | Git analysis + surveys | Durability of output (guards against volume inflation) |
Interpreting the Score:
| Score Range | Interpretation | Typical Action |
|---|---|---|
| 80-100 | Excellent -- AI investment is delivering strong, durable value | Optimize and scale |
| 60-79 | Good -- value being delivered but quality or adoption gaps exist | Investigate quality or adoption barriers |
| 40-59 | Fair -- AI tools adopted but value realization incomplete | Focus on training, workflow integration, and quality monitoring |
| Below 40 | Underperforming -- significant gaps in adoption, value, or quality | Reassess tool selection, rollout strategy, or organizational readiness |
For a complete guide to calculating and tracking this metric, see AI Value Realization Score: One Number for AI Engineering ROI.
The Developer AI Impact Framework is designed to be adopted incrementally. Do not attempt to measure all five pillars on day one.
Phase 1 (Weeks 1-2): Adoption Baseline Start with Pillar 1. Instrument your AI coding tools to capture WAU, DAU, and adoption distribution. This requires only API access to your tool admin dashboards (Cursor, Copilot, Claude Code) and a simple aggregation layer.
Phase 2 (Weeks 3-4): Code Share and Velocity Add Pillars 2 and 3. Connect your Git infrastructure to measure AI-assisted code percentage and begin classifying PRs by complexity for CAT calculation. This phase requires Git metadata analysis and a complexity classification model.
Phase 3 (Weeks 5-8): Quality Add Pillar 4. Begin tracking code turnover rate at 30-day windows, segmented by AI-generated versus human-written code. This requires historical Git data and a code attribution pipeline.
Phase 4 (Weeks 9-12): ROI and Surveys Add Pillar 5 and the qualitative survey layer. With all four preceding pillars instrumented, ROI calculation becomes a derivation. Launch the biweekly pulse survey to begin collecting perceived time savings and task fit data.
Phase 5 (Week 12+): Composite Scoring Once all five pillars and survey data are flowing, compute the AI Value Realization Score as your executive-level tracking metric.
Larridin operationalizes all five pillars using your existing tool stack -- Cursor, Claude Code, GitHub Copilot, and standard Git infrastructure -- without requiring developers to change their workflow or adopt new tools.
You measure the value of what gets delivered, not the volume of what gets produced. The Developer AI Impact Framework replaces raw output metrics (PRs, LOC, commits) with complexity-adjusted throughput (CAT), which weights each unit of work by its difficulty. A developer who ships one architecturally complex feature (8 CAT points) is producing more value than one who ships eight trivial config changes (8 x 1 = 8 points), even though the latter generated more PRs. Combined with code turnover measurement, this separates genuine productivity from AI-inflated volume.
DORA metrics remain partially useful but are no longer sufficient. Deployment Frequency and Lead Time for Changes are structurally inflated by AI-generated code -- more code ships faster, but that does not necessarily mean more value is delivered. Change Failure Rate misses the code churn problem entirely. MTTR remains relevant. The Developer AI Impact Framework does not replace DORA so much as it extends the measurement surface to account for AI's impact on code volume, code quality, and the relationship between the two. See Why DORA Metrics Break in the AI Era for a detailed analysis.
Complexity-Adjusted Throughput (CAT) is a velocity metric that weights engineering output by the difficulty of the work delivered. Each PR or unit of work is classified as Easy (1 point), Medium (3 points), or Hard (8 points) based on the nature of the change. CAT per engineer per week replaces raw PR counts and LOC as the primary velocity measure. The exponential scaling (1-3-8 rather than 1-2-3) reflects the reality that hard engineering problems require disproportionately more skill, context, and judgment than easy ones. See What Is Complexity-Adjusted Throughput? for the complete methodology.
The primary quality metric for AI-generated code is code turnover rate -- the percentage of AI-generated lines that are rewritten or deleted within 30 or 90 days. Pre-AI code churn baselines were around 3.3%. In organizations with heavy AI coding tool adoption, overall churn has risen to 5.7-7.1%1. Healthy AI code quality means keeping AI code turnover within 1.5x of human code turnover. When AI code turnover exceeds 2x human turnover, it signals that AI-generated code is creating rework rather than accelerating delivery. See What Is Code Turnover Rate? for measurement methodology.
A healthy AI coding tool adoption rate is above 50% WAU within 90 days of deployment, with a target of 60-70% WAU at maturity. Industry average WAU currently sits at 30-40%. Leading engineering organizations achieve 60-70% WAU with 25-35% of developers qualifying as power users (daily usage across multiple AI tool modes). Below 30% WAU after the initial rollout period typically indicates adoption barriers -- licensing friction, training gaps, cultural resistance, or poor tool-workflow fit -- that require targeted intervention.
ROI on AI coding tools equals the productive value of time saved minus rework costs, divided by total AI tool cost -- including token and usage-based costs, not just seat licenses. The formula is: Net ROI = (Productive Value of Time Saved - Rework Cost from Code Turnover) / Total AI Tool Cost. Most organizations understate costs (using only seat license fees when agentic tools like Claude Code can cost $200-$2,000/month per engineer in token spend) and overstate returns (ignoring rework from low-quality AI code). Industry average ROI is 2.5-3.5x; top-quartile organizations achieve 4-6x through better prompt engineering, quality gates, and targeted use of agentic tools on high-complexity work. See How to Measure AI Coding Tool ROI for a step-by-step calculation guide.
Not inherently, but without quality measurement, it can. The GitClear data showing doubled code churn since AI adoption suggests that a significant portion of AI-generated code does not survive in the codebase long-term1. This is effectively technical debt in the form of rework. However, the solution is not to reduce AI code generation -- it is to measure code turnover rate and use it as a quality feedback loop. Teams that track AI code turnover and feed quality signals back into their development practices achieve AI code durability comparable to human-written code. See Code Churn in the AI Era for the full analysis.
The Developer AI Impact Framework was developed by Larridin.
GitClear, "Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality" (2024) and "AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones" (2025). Analysis of 211M+ changed lines showing code churn increased from 3.3% (2021 pre-AI baseline) to 5.7-7.1% by 2024, correlating with widespread AI coding tool adoption. ↩↩↩↩
Block engineering blog, "AI-Assisted Development at Block". Reports approximately 95% of engineers regularly using AI to assist development, with top engineers achieving high AI-assisted code rates in production workflows. ↩
GitHub, "Research: Quantifying GitHub Copilot's Impact in the Enterprise with Accenture" (2024). Enterprise acceptance rate data and productivity metrics across large-scale Copilot deployments. ↩
Benchmark data sourced from Larridin internal product research across enterprise engineering organizations using AI coding tools (2025-2026). Methodology: aggregated, anonymized engineering data across organizations of varying size and sector. ↩