Larridin Blog

The CTO's Playbook for Managing AI Coding Spend

Written by Ameya Kanitkar | May 13, 2026

A practical guide to controlling and balancing token costs across Claude Code, Cursor, Codex, and whatever ships next week — without killing velocity.

Spend is real. It's going up. And it comes with nasty surprises.

AI coding spend is now reaching 2–4% of engineering headcount budgets at silicon valley tech orgs, and 8–10% at AI-first teams. That is a massive line item — a line that didn't exist eighteen months ago — and it's only going up. For a 100-engineer org, 4% of a $25M headcount budget is a million dollars a year in tokens. For an AI-first shop, double that.

To make matters worse, unsupervised agents can burn thousands of dollars in accidental spend in a few hours. One engineering leader we work with shared that one of their engineers racked up $8,000 over the weekend in Claude Opus charges by accident running ralph-loops. Not malicious. Not even careless. The loop was doing real work — it just had no ceiling.

Spend is real. Spend is going up. And spend comes with nasty surprises if you don't actively manage it.

The tooling doesn't make this easier. The market is moving too fast to standardize. First it was Github Copilot. Then Cursor. Then Claude Code. Now Codex is catching up and Kimi just shipped a coding model that looks excellent. Committing to one tool for the year is a losing strategy. Committing to none means your team falls behind. And the pricing ground keeps shifting underneath you — Cursor recently moved from flat subscription billing to true token-based pricing, and the rest of the category is likely to follow as usage outpaces subscription economics.

This is the playbook I wish someone had handed me six months ago.

Decide the budget before you decide the tools

Before picking a proxy, a gateway, or a dashboard, answer one question: how much can you afford to spend on intelligence per quarter?

Not per year. Annual budgets for this category are fiction — the models, the pricing, and the workflows will all have changed by Q3. Quarterly is the right clock speed.

Set the number. Write it down. That number is the ceiling you're managing against. Everything else in this playbook is about not blowing through it — and knowing what you got for it.

Eliminate tail-end surprises before you optimize anything else

Cost optimization is a second-order problem. Surprise prevention is the first-order problem.

Before you think about per-team chargeback, per-engineer allocation, or model routing, ask: if one engineer on my team kicked off a long-horizon agent loop tonight, would I know about it before Monday morning?

For most teams the honest answer is no.

The minimum viable controls, in priority order:

 

  1. A hard daily spend cap per API key, set at something like 3–5x your expected daily burn. Not a ceiling you hit often — a tripwire for runaway loops.
  2. Alerts on anomalies, not just thresholds. A 10x spike versus a 7-day rolling average matters more than "did we cross $X today."
  3. Visibility at the session level, so when something does spike you can see which workflow caused it, not just that something did.
  4. A named owner. Someone — usually a platform lead — who gets paged when these fire. Without that, alerts become noise.

Everything else can wait. This cannot. (How to do this exactly step by step is at the bottom of this article)

Match your controls to your actual posture

Cost management needs match how you're using AI coding tools today. Most of the pain I see in the field comes from teams whose controls are two levels behind how their engineers actually work. Find yourself honestly below.

Stage 1: Subscriptions only

You're using Claude Code subscriptions, Cursor seats, or Copilot licenses. Flat monthly fees. Predictable.

This is a fine place to start. Don't overbuild. If you're a 20-person team just getting serious about AI coding, buy seats and pay attention.

The catch: the subscription model is eroding. Cursor moved to usage-based credits in June 2025, and teams that were paying $20/engineer/month now have individuals running $20–$30/day during heavy use. Claude Code's subscription is still predictable today, but the direction of travel is clear. Treat subscription-only as a temporary equilibrium, not a long-term strategy.

What to do now: set usage alerts inside each tool's admin console. Know which of your engineers are pushing the limits. When you see more than 20–30% of your team regularly exceeding their plan's included credits, move on.

Stage 2: Subscriptions plus direct API

You've outgrown the bundled plans, or you want capabilities (Codex, a newer Kimi model, Claude via API for an internal agent) that aren't in your subscription tools. You're handing out API keys.

This is where the weekend-blowup stories happen.

What you need:

 

  • One API key per engineer or per small team, never a shared "team key" that can't be attributed.
  • Per-key spend limits configured directly with the provider (Anthropic, OpenAI, etc.) — all major providers support this now.
  • Alerts at 50%, 80%, and 100% of the cap.
  • A weekly review ritual: someone — VP Eng, platform lead, or a delegate — looks at the top five spenders and asks "does this look right?" Not to police, but to catch loops and pattern-match on who's finding leverage.

At this point you'll start feeling the observability gap: subscription tools report in their own dashboards, API spend shows up in provider consoles, and nothing talks to anything else. This is a signal, not a crisis. You're ready to graduate when you can't answer "what did engineering spend on AI last week, by team, across all tools" in under ten minutes.

Stage 3: Proxy / AI gateway

This is the inflection point. You introduce a proxy layer that sits between your engineers and the model providers. Every request goes through it. Every request gets logged, attributed, and controlled.

Concretely, a proxy gives you:

 

  • Virtual keys per engineer or per team, with independent budgets, without juggling real provider credentials.
  • Hard budget enforcement — not just alerts, actual cutoffs.
  • Model routing — default cheap models for routine completions, expensive frontier models only when the task calls for it.
  • Fallbacks when a provider has an outage.
  • One log of everything — prompts, token counts, latencies, costs — in a single place.
  • Optionality. When the next model ships, you add it to the proxy and your engineers can use it without procurement cycles or credential distribution.

The proxy layer is also the honest answer to the "don't commit to one tool" problem. The tool changes every quarter. The proxy abstraction doesn't.

Picking a proxy. The landscape has sorted itself into clear archetypes:

 

  • LiteLLM — open-source, self-hosted, MIT-licensed. Maximum control, broadest provider coverage, built-in budget controls per team/user/key. Right choice if you have the engineering bandwidth to run it and want no external dependencies.
  • OpenRouter — hosted, simplest setup, largest model catalog (300+ models). Charges a markup on top of provider rates (roughly 5%). Right choice for fast starts and experimentation.
  • Portkey — hosted with enterprise features (guardrails, prompt management, SOC2/HIPAA/GDPR). Right choice for regulated industries or larger orgs that need compliance controls bundled in.
  • Helicone — hosted, observability-first, low-overhead Rust implementation, self-host option. Right choice when cost visibility and analytics matter more than routing sophistication.

For most 50–500 person engineering orgs: start with OpenRouter / LiteLLM. You can migrate later — the API surface is OpenAI-compatible across all of them.

Unified cost intelligence tied to output

Here's the uncomfortable truth about the proxy layer: it gives you costs, not answers.

You'll see that Alex spent $4,200 last month and Jordan spent $1,100. So what? Maybe Alex shipped three refactors that eliminated two weeks of manual work and Jordan was stuck in a loop debugging a flaky test. Or maybe it's the other way around. The proxy can't tell you.

This is the gap that matters next: spend is an input, not an outcome. The right question is never "who spent the most?" It's "what did we get for what we spent?" Token spend needs to be joined to PR velocity, code turnover, quality signals, and the actual work engineers are doing. Without that, you're managing a cost center. With it, you're managing an investment.

You'll also discover that the proxy only covers the API-based portion of your spend. Cursor subscriptions, Claude Code subscriptions, Copilot seats, whatever your AI code review tool charges — none of that is in the proxy. The full picture requires pulling from a dozen sources.

This is where the default framing ("cap everyone") is actively wrong. The move is the opposite: generous defaults, sharp tripwires, investigated outliers. The engineer who spent $8K might have shipped something worth $800K. Or might have been stuck. You can't know from the spend number alone — and you shouldn't be punishing the former to prevent the latter.

What to actually alert on

If you take nothing else from this piece, take this list. These are the alerts that catch real problems:

 

  • Daily spend > 3x the 7-day rolling average for any individual key. The classic runaway-loop signal.
  • Single-session burn rate > $X in under an hour. Tune X to your org's baseline.
  • Model mix drift. If an engineer suddenly routes 80% of requests through Claude Opus or GPT-5, that's usually either a real need (fine) or a misconfiguration (catch it).
  • Token-per-PR trending up without output trending up. A leading indicator that someone's stuck.
  • Monthly spend crossing 60% of quarterly budget by week 6. You have time to course-correct; don't wait until you're at 100%.

Most of these require joined data across sources. Which is the point.

The "who owns this" question

One more thing that trips up nearly every org I talk to: nobody owns this line item.

Finance looks at it and sees an infrastructure cost they don't understand. VPs of Engineering look at it and see a finance problem. Platform/DevEx teams look at it and see a policy problem. The result: it drifts until it's a crisis.

Assign it. Name one person — usually a platform lead or a senior engineering manager with finance partnership — as the owner of AI coding spend. Give them the budget, the alerts, the proxy, and the quarterly review. Without that, none of this works.

The short version

 

  1. Set a quarterly budget. Typical orgs land at 2–4% of engineering headcount; AI-first teams hit 8–10%.
  2. Eliminate surprise blowups before you optimize anything else.
  3. Start with subscriptions. Graduate to direct API with per-engineer limits. Graduate to a proxy when you can't answer basic questions across tools.
  4. Pick the proxy that matches your constraints — LiteLLM for control, OpenRouter for speed, Portkey for compliance, Helicone for observability.
  5. Join spend data to output data. Manage investment, not cost.
  6. Name an owner.

The companies that get this right will spend more on AI coding than their competitors — and get disproportionately more out of it. The ones that don't will either under-invest out of fear, or find themselves writing a surprise check to Anthropic next quarter that nobody budgeted for.

At Larridin, we help engineering leaders unify AI coding spend across subscriptions, APIs, and proxies — and map it back to developer output. If cost visibility or AI impact measurement is on your plate, we'd love to talk.