Complexity-Adjusted Velocity for AI Teams

March 28, 2026

Complexity-adjusted velocity weights engineering output by difficulty, not volume — Larridin uses it to replace raw metrics that break when AI inflates code output 10x.

That sentence carries a lot of weight, so here's the concrete version. A developer uses Cursor to scaffold a 500-line CRUD API in 20 minutes. Another developer spends three days writing 50 lines that fix a distributed locking race condition affecting payment processing. Your commit metrics say the first developer is 10x more productive. Your production incident history says the opposite.

This disconnect isn't hypothetical. It's the central measurement crisis facing every engineering organization that adopted AI coding tools in the last two years. And the metrics most teams still rely on — DORA, SPACE, story points, lines of code — were never designed to handle it.

TL;DR

Raw velocity metrics are broken — AI users produce 4–10x more commits but also 9x more code churn and 4–8x more duplication, inflating dashboards while potentially degrading codebases
Weight output by difficulty — complexity-adjusted velocity scores code changes by cyclomatic complexity, dependency depth, blast radius, and 30/90-day durability instead of raw volume
Use a five-pillar framework — measure AI adoption, AI code share, complexity-adjusted velocity, quality/slop, and cost/ROI together so the metrics create healthy tension with each other
Senior engineers finally get credit — the hardest problems produce the fewest lines of code; complexity adjustment surfaces high-value contributors that raw metrics make look underproductive
Replace your metric stack, don't add to it — five pillars measured continuously and reported in weekly auto-generated retros give better signal than DORA + SPACE + story points combined

Why Raw Velocity Metrics Fail in the AI Era

The numbers tell a story that should make any VP of Engineering uncomfortable.

GitClear's analysis of 211 million lines of code across thousands of repositories found that high AI users produce 4–10x more commits and durable code changes than non-users. That sounds like a win. But the same study found those users also generate 9x more code churn — code rewritten or deleted shortly after merging — alongside a 4–8x increase in duplicated code blocks and a 39.9% drop in refactoring activity.

Faros AI confirmed the organizational picture is even worse: developers on high-AI-adoption teams complete 21% more tasks individually, but PR review time increases 91%. The system bottleneck shifted from writing code to reviewing it. Apply Amdahl's Law and the math is brutal — your pipeline moves at the speed of its slowest stage, and AI made the slow stage slower.

METR's randomized controlled trial delivered the punchline: experienced developers using AI tools actually took 19% longer to complete tasks, despite believing AI made them 20% faster. A 39-point perception-reality gap.

Every traditional metric — deployment frequency, lead time, commit velocity, lines changed — treats these outputs as equivalent. A 500-line AI scaffold and a 50-line concurrency fix get the same weight. A PR that ships clean and a PR that generates three follow-up patches count identically. When your measurement system can't distinguish difficulty, every AI-generated line of code inflates your numbers while potentially degrading your codebase.

We covered why DORA metrics specifically break under these conditions — the short version is that DORA measures delivery outcomes without any mechanism for complexity normalization or AI attribution.

What Complexity-Adjusted Velocity Actually Measures

Strip away the buzzword and the concept is intuitive: output divided by difficulty, not output alone.

Traditional velocity says: "This team shipped 47 story points this sprint." Complexity-adjusted velocity says: "This team shipped 47 story points, of which 31 involved multi-service coordination across three APIs, required migration-safe schema changes, or touched code paths with >15 cyclomatic complexity."

The adjustment happens across multiple dimensions:

Code complexity signals. Cyclomatic complexity (control flow paths), cognitive complexity (mental effort to understand), dependency depth, and the number of systems touched per change. A PR modifying one endpoint in a standalone service gets a lower complexity weight than a PR modifying three services with shared state. NASA mandates cyclomatic complexity under 10 for critical systems — if a developer is routinely working in code above that threshold, their velocity per unit of effort is higher by definition.

Context difficulty. Not all code is created in the same environment. Writing net-new code on a greenfield project is fundamentally different from modifying a 15-year-old monolith with undocumented side effects. Complexity-adjusted velocity factors in repo age, test coverage of the affected area, and the density of existing integrations.

Durability. Here's where AI output gets properly discounted. If a change generates churn within 30 or 90 days — reverts, rewrites, follow-up patches — the original velocity credit gets reduced proportionally. GitClear's data showing 9x churn among heavy AI users means their complexity-adjusted velocity would be significantly lower than their raw output suggests. Good. That's the signal you need.

Blast radius. A change touching a payments processing pipeline carries more risk and requires more judgment than a change touching an internal admin dashboard. Complexity-adjusted velocity weights by the criticality of the systems involved, because the engineering skill required to ship safely in high-blast-radius areas is meaningfully higher.

The result is a metric that can finally answer the question engineering leaders actually care about: "Is this team solving harder problems faster, or just generating more code?"

The Five-Pillar Framework: Where Complexity-Adjusted Velocity Fits

Complexity-adjusted velocity doesn't work in isolation. A team could score well on complexity-adjusted output while drowning in AI-generated technical debt, or while burning through $400K/month in Copilot seats with unclear ROI. You need the full picture.

Larridin's measurement framework uses five pillars that together capture what no single metric can:

Pillar	What It Measures	Why It Matters
AI Adoption	Tool usage depth — not just "do they use Copilot?" but frequency, context, workflow stage	Identifies who's actually integrating AI vs. who opened it once
AI Code Share	Percentage of shipped code attributable to AI assistance	Quantifies how much of your output is AI-generated (the denominator for quality analysis)
Complexity-Adjusted Velocity	Output weighted by difficulty, durability, and blast radius	Separates real engineering progress from volume inflation
Quality / AI Slop	Duplication, churn, architectural drift, test-implementation coupling	Catches the hidden quality costs of AI-generated code before they compound
Cost / ROI	AI tool spend mapped to measurable output improvements	Validates whether your $2M annual AI investment actually returns value

The pillars are designed to create tension with each other. High velocity with low quality means something different than high velocity with high quality. High AI adoption with low code share suggests tool friction. High code share with high slop scores means the AI is productive but sloppy.

This is the fundamental problem with DORA and SPACE — they measure delivery speed and developer sentiment without connecting to what the code actually looks like or what the AI tools actually contribute. The SPACE framework's limitations become especially visible when you try to evaluate individual contributor (IC) performance, which both frameworks explicitly avoid addressing.

Practical Complexity Normalization: What This Looks Like in Practice

Abstract frameworks don't ship. Here's how complexity normalization works in concrete terms.

Scenario 1: The CRUD Sprint. An engineer uses Claude Code to generate a new microservice — REST endpoints, database models, basic validation, Docker configuration, CI pipeline. Total output: 2,100 lines across 14 PRs in one sprint. Raw velocity: exceptional. Complexity-adjusted velocity: moderate. The cyclomatic complexity of each file is under 5. No existing system dependencies. Zero blast radius beyond the new service. The work was real but structurally simple — exactly the kind of work AI excels at.

Scenario 2: The Distributed Systems Fix. Another engineer spends the same sprint diagnosing and fixing a race condition in a distributed job scheduler. The fix touches three services, requires a two-phase migration to avoid breaking in-flight jobs, and involves code paths with cyclomatic complexity above 20. Total output: 83 lines across 2 PRs. Raw velocity: poor. Complexity-adjusted velocity: high. The difficulty multiplier on the affected systems, combined with the durability of the fix (zero churn at 90 days), produces a score that accurately reflects the engineering skill involved.

Scenario 3: The High-Volume, High-Churn Developer. A third engineer ships 3,400 lines across 22 PRs. Impressive on a dashboard. But 30-day analysis reveals 40% of that code was either reverted or substantially rewritten. The AI Slop Index flags elevated duplication and three instances of pattern mimicry — the AI generated a fourth database access pattern instead of using the existing shared module. Complexity-adjusted velocity after churn correction: below team median despite being the highest raw output producer.

These aren't cherry-picked examples. They're composites from patterns we see in production data across engineering organizations of 50 to 5,000 developers.

Weekly Retros and Per-IC Reports: Operationalizing the Data

Metrics that live in a quarterly dashboard don't change behavior. By the time someone reviews them, the damage is done or the wins are forgotten.

Larridin generates weekly autogenerated retrospectives that surface complexity-adjusted velocity alongside the other four pillars — per team and per individual contributor. The reports aren't vanity dashboards. They're designed to trigger specific conversations.

At the team level, weekly retros answer questions like: Did our complexity-adjusted velocity hold while our AI code share increased? If velocity rose but quality dropped, we're shipping more slop. If velocity held and quality held while AI adoption increased, the tools are actually working. These are the leading indicators that quarterly DORA reviews miss entirely.

At the IC level, reports provide data that engineering managers have never had before — not surveillance metrics, but context-rich performance signals. When an IC's raw output is high but their complexity-adjusted velocity is low, that's not a performance problem. It might mean they're doing necessary but structurally simple work. Or it might mean AI is doing most of the thinking and they're approving without judgment. The difference between those two scenarios requires a manager who looks at the data and has a conversation — which is exactly what the reports are designed to enable.

The per-IC view also catches the inverse problem: senior engineers whose raw output looks modest because they're working on the hardest problems. Without complexity adjustment, these engineers look underproductive on every dashboard. With it, they show up as the highest-value contributors — which is what their incident response record already told you, if anyone was cross-referencing.

Retros auto-generate every Monday. No one has to pull data, build charts, or schedule a review meeting just to understand whether the team's AI investment is working.

Replacing Broken Metrics, Not Adding More

The instinct when metrics fail is to add new ones. Stack DORA with SPACE with McKinsey's contribution analysis with developer experience surveys and hope the pile of numbers produces clarity.

It doesn't. McKinsey's 2023 attempt to measure individual developer productivity — using story points and contribution volume — drew immediate criticism from Kent Beck, Gergely Orosz, and most of the engineering leadership community. The core objection: output proxies like commits and story points are trivially gamed, punish senior engineers who do mentoring and architecture work, and collapse the difference between easy and hard problems. Goodhart's Law ensures that any metric used as a target stops being a useful metric.

Complexity-adjusted velocity isn't immune to gaming. No metric is. But it's structurally harder to game because the complexity weights are derived from code analysis, not self-reported estimates. You can't inflate your cyclomatic complexity score by splitting a PR into smaller pieces. You can't claim systems-level blast radius on a change that touched one file. The signals come from the code itself, not from a Jira ticket someone estimated on a Monday standup.

The goal isn't to add complexity-adjusted velocity to your existing metric stack. The goal is to replace the stack. Five pillars — adoption, code share, complexity-adjusted velocity, quality, and cost — measured continuously and reported weekly. That's fewer metrics than most teams track today, with dramatically better signal.

Frequently Asked Questions

How is complexity-adjusted velocity different from weighted story points?

Weighted story points still depend on human estimation at planning time — a process that's subjective, inconsistent across teams, and easily inflated. Complexity-adjusted velocity derives its weights from automated code analysis after the work ships: cyclomatic complexity, dependency graphs, system criticality, and 30/90-day durability. The measurement happens at the code level, not the ticket level, which makes it resistant to estimation gaming.

Can complexity-adjusted velocity be used to rank individual developers?

It can surface individual data, but ranking misses the point. The value is in identifying patterns — like a senior engineer whose raw output looks low because they're solving the team's hardest problems, or a junior developer whose high output masks quality issues. Larridin's per-IC reports provide this context without reducing people to leaderboard positions. Managers should use it for coaching conversations, not stack ranking.

What tools do I need to measure complexity-adjusted velocity?

At minimum, you need static analysis (cyclomatic and cognitive complexity), git-level change tracking (churn, revert rates), and some mechanism for mapping changes to system criticality. Doing this manually across repos is possible but doesn't scale beyond a few teams. Larridin automates the full pipeline — from code analysis to weekly reports — across all five measurement pillars.

How does complexity-adjusted velocity handle different programming languages?

Complexity weights are language-aware. A 50-line Rust change involving unsafe blocks and lifetime management carries different complexity than a 50-line Python script. The analysis accounts for language-specific complexity indicators, type system constraints, and the historical churn rates of similar changes in each language context.

Does AI-generated code always score lower on complexity-adjusted velocity?

Not always, but typically yes. AI excels at generating high-volume, structurally simple code — scaffolding, boilerplate, CRUD operations. These are genuinely useful but carry lower complexity weights. When AI-generated code also exhibits higher churn rates (GitClear found 9x more churn among heavy AI users), the durability adjustment further reduces its complexity-adjusted score. The metric isn't anti-AI — it's pro-accuracy.

How quickly can teams adopt complexity-adjusted velocity measurement?

Larridin's five-pillar framework can be deployed in under a week for most engineering organizations. The browser extension and optional desktop agent begin collecting adoption and code share data immediately, while the containerized repo analysis starts scoring complexity and quality within the first analysis cycle. Weekly retros begin generating from the second week onward — teams see their first actionable data within 14 days.

Complexity-Adjusted Velocity: The Right Way to Measure AI-Powered Engineering Teams