Skip to main content

TL;DR

  • Complexity-Adjusted Throughput (CAT) measures developer output by weighting each pull request by its difficulty — Easy (1 point), Medium (3 points), Hard (8 points) — instead of counting raw volume like lines of code, PRs merged, or commits.
  • CAT exists because AI coding tools inflate traditional throughput metrics. In high-adoption organizations, AI generates 30-70% of committed code. A 3x increase in lines of code does not mean 3x more value delivered.
  • CAT segments AI-assisted work from human-only work, revealing whether AI is accelerating Easy tasks only or genuinely helping with Hard engineering problems.

The Problem with Traditional Throughput Metrics

For decades, engineering leaders have reached for the same set of volume-based proxies to measure developer output. None of them were perfect before AI. With AI, they are actively misleading.

PRs merged per week rewards splitting work into small pull requests. AI makes this trivially easy. A developer can instruct an AI assistant to break a single feature into five atomic PRs in seconds. PR count doubles; value delivered stays the same.

Lines of code per week rewards verbosity. AI can generate thousands of lines in a single session — boilerplate, scaffolding, test files that mirror existing patterns. A 4x increase in LOC output tells you nothing about whether any of that code solves a meaningful problem.

Commits per week rewards commit frequency, not value. AI-assisted workflows naturally produce more commits: exploratory code, incremental changes suggested by the tool, automated refactors. Commit counts rise regardless of whether the underlying work is substantial.

The common thread: when AI generates a significant share of committed code -- 30-70% in high-adoption teams -- all of these metrics inflate. A team's raw throughput can triple after adopting AI tools while their actual engineering output — the hard problems solved, the features shipped, the architectural decisions made — barely moves. Volume is no longer a proxy for value.

What Is Complexity-Adjusted Throughput?

Complexity-Adjusted Throughput (CAT) is a developer productivity metric that measures engineering output by weighting each unit of work by its difficulty rather than counting raw volume. It replaces lines of code, PR counts, and commit frequency with a single score that reflects the cognitive and technical difficulty of what was actually built.

How it works:

Every pull request (or commit, depending on your workflow) is scored by complexity on a three-tier scale:

  • Easy = 1 point. Boilerplate code, configuration changes, simple bug fixes, documentation updates, dependency bumps, auto-generated scaffolding.
  • Medium = 3 points. Standard feature work, moderate refactoring, meaningful test additions, API endpoint implementation, database migration with straightforward schema changes.
  • Hard = 8 points. Architectural changes, complex algorithm implementation, cross-system integrations, performance optimization requiring profiling, security-critical changes, large-scale data model redesigns.

CAT = Sum of complexity points per engineer per week.

The scoring is non-linear by design. A Hard PR requires roughly 8x the cognitive effort of an Easy PR — whether human-written or AI-assisted. An engineer who ships one Hard PR and two Easy PRs in a week scores 10 points. An engineer who ships ten Easy PRs scores 10 points. Same score, reflecting roughly equivalent output — even though the second engineer has five times more PRs.

This non-linearity is what makes CAT resistant to AI inflation. AI excels at Easy work. It can generate boilerplate, configuration changes, and simple fixes at extraordinary speed. But Hard work — the kind that requires deep system understanding, architectural judgment, and cross-cutting design decisions — remains substantially human-driven. CAT weights accordingly.

How to Compute CAT

Implementing CAT requires three steps: classify PR complexity, segment by AI attribution, and aggregate.

Step 1: Classify PR Complexity

There are three approaches to classification, each with trade-offs:

Manual tagging by developers. Developers apply a complexity label (Easy, Medium, Hard) to each PR. This is the most accurate method because the author best understands the cognitive effort involved. The downside is compliance: developers skip labels when they are busy, and self-reporting introduces bias toward overestimating difficulty.

Automated classification. Use signals from the PR itself to infer complexity: number of files touched, diff size, file types modified (configuration vs. core logic), cross-repository changes, number of services affected, presence of database migrations, and changes to critical paths. Rule-based systems or lightweight ML classifiers can assign complexity tiers with reasonable accuracy. The downside is that automated systems miss context — a 20-line change to a consensus algorithm is harder than a 2,000-line change to test fixtures.

Hybrid: automated suggestion with developer override. The system proposes a complexity tier based on automated signals. The developer confirms or overrides. This captures the benefits of both approaches: automation handles the common cases, and human judgment handles the edge cases. This is the recommended approach for most teams.

Step 2: Segment by AI Attribution

Split CAT into two components:

  • AI-Assisted CAT: Complexity points from PRs where AI coding tools were used (detected via editor telemetry, commit metadata, or developer self-report).
  • Human-Only CAT: Complexity points from PRs with no AI assistance.

This segmentation is where CAT becomes diagnostic rather than merely descriptive. It reveals whether AI is helping with Easy work only — generating boilerplate and configuration changes — or whether it is genuinely accelerating Medium and Hard work. If a team's AI-Assisted CAT is concentrated entirely in the Easy tier, they are capturing only a fraction of AI's potential value.

Step 3: Aggregate and Normalize

Roll CAT up across your organization's hierarchy:

  • Per engineer: Individual output, useful for growth conversations (not performance ranking).
  • Per team: Team-level throughput, useful for capacity planning and sprint retrospectives.
  • Per repo or service: Where is complexity concentrated? Which systems demand the most Hard work?
  • Per org: Executive-level view of engineering output, comparable across quarters.

Weekly cadence is recommended, aligned to your sprint cycle. Monthly aggregation smooths variance but delays signal. Daily is too noisy.

Benchmarks

These benchmarks reflect data from AI-native engineering organizations.

Metric Bottom Quartile Industry Average Top Quartile Elite
CAT per engineer per week (all work) <5 pts 8 pts 14+ pts >20 pts
CAT among engineers using AI tools <8 pts 12 pts 18+ pts >25 pts
CAT among engineers not using AI tools <4 pts 7 pts 10+ pts >14 pts

Note: These rows are separate cohorts, not components that sum. "CAT among engineers using AI tools" measures the average output of engineers who use AI assistance. "CAT among engineers not using AI tools" measures the average output of engineers who do not. The "all work" row is the org-wide average across both groups, weighted by the proportion of engineers in each cohort.

The gap between AI-assisted and human-only cohorts is the AI velocity multiplier. Industry average is approximately 1.7x. Elite teams reach 1.8-2.0x, though the multiplier compresses at higher absolute CAT levels because Hard PRs benefit less from current AI tools than Easy and Medium ones.

Why CAT Matters in the AI Era

Signal vs. noise. When AI inflates raw PR counts and lines of code, CAT cuts through by weighting complexity. A developer who ships two Hard PRs in a week (16 points) has demonstrably more impact than one who ships ten Easy PRs (10 points). Traditional metrics would show the opposite: ten PRs looks better than two. CAT corrects this inversion.

Diagnostic power. When velocity drops, CAT decomposition tells you why. Are engineers shipping fewer PRs overall? Shipping the same number of PRs but at lower complexity? Shipping less AI-assisted work? Each pattern points to a different intervention. Fewer PRs might mean blocked reviews. Lower complexity might mean the team is stuck on operational work. Less AI assistance might mean tooling friction. Raw PR counts give you one signal: "less." CAT gives you a direction.

AI attribution. By segmenting AI-Assisted CAT from Human-Only CAT, you see what AI is actually contributing to your team's output. If AI-Assisted CAT is concentrated in Easy work, your team is using AI as a typing accelerator. If AI-Assisted CAT shows up in Medium and Hard work, AI is functioning as an engineering multiplier. The difference matters for tooling investment, enablement strategy, and realistic ROI calculations.

Red Flags

CAT flat while raw PRs increase. AI is inflating volume without producing real output. More PRs at the same total complexity means the team is shipping more trivial work. This is the most common failure mode of early AI adoption: the dashboard looks great (PR count up 3x!) while actual engineering output has not changed.

AI-Assisted CAT concentrated in the Easy range. AI is not being applied to meaningful work. This typically indicates a prompting skill gap, lack of enablement for complex use cases, or developers who do not trust AI output for non-trivial tasks. The intervention is targeted training and shared examples of AI-assisted Hard work.

Individual CAT variance exceeding 3x within a team. If one engineer scores 20 points per week and a teammate scores 6, the gap warrants investigation — but not immediate conclusions. The variance might reflect a measurement problem (inconsistent complexity labeling), a task distribution issue (one engineer is assigned all the Hard tickets), or a genuine skill gap. Diagnose before acting.

CAT vs. Other Throughput Metrics

Metric What It Rewards AI Inflation Risk Complexity Awareness AI Attribution
PRs/week Splitting work High None None
LOC/week Verbosity Very High None None
Commits/week Commit frequency High None None
Story Points/week Estimation accuracy Low (human-estimated) Moderate None
CAT/week Meaningful output Low (complexity-weighted) Yes (Easy/Medium/Hard) Yes (AI vs. Human)

Story points deserve a specific note. They offer moderate complexity awareness because teams estimate difficulty before work begins. But story points are pre-work estimates, not post-work measurements. They reflect what a team expected, not what actually happened. CAT is measured after the work is done, based on the actual PR delivered. In AI-assisted workflows where the gap between estimated and actual effort can shift dramatically, post-hoc measurement is more reliable than pre-work estimation.

How CAT Fits the Developer AI Impact Framework

CAT is Pillar 3 (Velocity) of Larridin's Developer AI Impact Framework. It measures how much meaningful work gets done, adjusted for difficulty and segmented by AI involvement.

It works alongside four companion pillars:

  • Pillar 1: AI Adoption — Are teams actively using AI tools? (Measured by Weekly Active Users, power user density.)
  • Pillar 2: AI Code Share — What percentage of code is AI-generated? (Measured by AI-assisted lines, PRs, and commits.)
  • Pillar 4: Quality — Is AI-generated code durable, or does it churn? (Measured by Code Turnover Rate at 30 and 90 days.)
  • Pillar 5: Cost and ROI — Is the investment in AI tools paying off? (Measured by hours saved, rework costs, annualized ROI.)

No single metric tells the full story. CAT measures how fast a team moves; Code Turnover Rate checks whether that speed creates lasting value or technical debt. A team with elite CAT scores and poor code durability is shipping fast but building on sand.

Read the full Developer AI Impact Framework -->

Frequently Asked Questions

What is complexity-adjusted throughput?

Complexity-adjusted throughput (CAT) is a developer productivity metric that weights each pull request by its difficulty rather than counting raw volume. PRs are scored as Easy (1 point), Medium (3 points), or Hard (8 points), and summed per engineer per week. This produces a throughput measure resistant to AI inflation, because generating ten Easy PRs (10 points) scores less than shipping two Hard PRs (16 points) — reflecting the actual engineering effort involved.

How do you classify PR complexity?

The most effective approach is hybrid: automated classification with developer override. Automated signals — files touched, diff size, file types, cross-service changes — can reliably distinguish Easy from Medium work. The Hard tier requires human judgment because a small diff in a critical system can represent enormous complexity that automated tools miss. Start with automated classification and add developer overrides for accuracy.

Does CAT replace story points?

CAT does not replace story points for sprint planning, but it does replace them for productivity measurement. Story points are pre-work estimates useful for capacity planning. CAT is a post-work measurement based on what was actually delivered. Both have value, but for measuring team output — especially in AI-assisted workflows where the gap between estimated and actual effort shifts — CAT is more reliable because it measures reality rather than prediction.

How does AI affect throughput measurement?

AI inflates every traditional throughput metric — lines of code, PRs merged, commits per week — by making it trivial to generate high volumes of simple code. A team adopting AI tools can see LOC output triple without any increase in meaningful engineering work. CAT addresses this by weighting complexity: AI-generated boilerplate scores 1 point per PR regardless of how many lines it contains, while a complex architectural change scores 8 points whether it took an hour or a week. The metric tracks value, not volume.

What is a good CAT score?

Industry average CAT is 8 points per engineer per week for all work, and 12 points per engineer per week for AI-assisted work. Top-quartile teams score 14+ points (all work) or 18+ points (AI-assisted). These benchmarks reflect data from AI-native engineering organizations as of early 2026 and will shift as AI tooling matures and teams develop more sophisticated workflows for AI-assisted Hard work.

Footnotes

Methodology references:

  • Larridin. "The Developer AI Impact Framework." The framework that defines CAT as the Pillar 3 velocity metric, alongside AI Adoption, AI Code Share, Code Turnover Rate, and Cost/ROI.
  • The Easy/Medium/Hard scoring system (1/3/8) follows a modified Fibonacci weighting commonly used in engineering estimation. The non-linear scale reflects research on cognitive load: complex tasks require disproportionately more effort than simple ones, regardless of tooling.
  • GitClear, "Coding on Copilot" (2024). Evidence that AI inflates traditional volume metrics while increasing code churn.
  • Benchmark data derived from aggregated, anonymized engineering data across organizations of varying size and sector, current as of early 2026 (Larridin internal benchmark).

Related Resources