There is a CTO doing the rounds in the press at the moment who said, more or less, that one of his engineers spent forty thousand dollars on tokens in a single month, and that he genuinely did not know whether to stop them or tell everyone else to copy them. I read that and laughed, and then I stopped laughing, because we have been somewhere adjacent to that feeling ourselves.
The era of “use the best model, costs be damned” ended somewhere around the spring of this year. The Wall Street Journal reported that executives at the likes of Uber, Meta, Microsoft, Salesforce, and DoorDash launched cost-cutting campaigns after watching their AI bills double, triple, or quietly devour an annual budget in a single quarter. The corporate playbook flipped. A year ago the instruction was “flood the organisation with AI.” Now it is “ration it”. TechCrunch’s reporting filled in the specifics that made engineers wince: a major company that exhausted its entire annual AI coding budget by April, developers who had their coding-assistant licences revoked, a contract that came back at four to five times the previous price on renewal.
And here is the confession, because you may as well have it up front: we hit this wall too. Credo AI put spend caps in place. This is not a post about other people’s invoices. This is actually a post about what we did when the invoice was ours.
Tokens got cheaper & the bill got bigger
Here is the part that breaks people’s intuition, so let’s deal with it first.
The price of a token has been falling. Capability that cost a fortune in early 2024 is now something like ninety-eight per cent cheaper to buy. By every spreadsheet drawn up two years ago, the bills should be shrinking.
But they aren’t.
They’re exploding.
This is Jevons’ paradox arriving in your engineering org, so let me geek out a bit. When a resource becomes cheaper and more useful, you don’t consume less of it; you get a reason to consume drastically more, and the total bill rises even as the unit price collapses. Coal got more efficient and Britain burnt more of it. Tokens got cheaper, and we’re burning them by the billion.
Three multipliers, each compounding the others:
1. The first is always-on agents. Consumption used to be bounded by a human typing a prompt and reading a reply. An agent is not bound by your typing speed. It loops. It reads, plans, executes, verifies, and reads again, and the loop goes on.
2. The second is fat context windows. Every interaction now processes far more text than an early model ever did; whole files, whole directories, whole histories, and you pay for all of it in every turn.
3. The third is reasoning models, which think out loud before they answer. That thinking is real work and it is genuinely useful, and it is also tokens you are billed for whether you ever see them.
So the unit of cost changed. We budgeted for AI the way we budget for SaaS: per seat, per head, a predictable line item. But the cost of agentic work is not the seat. It is the token, and nobody’s spreadsheet was watching the token. The per-engineer numbers bear this out: independent data has per-engineer token consumption rising by something close to eighteen-fold inside nine months. You don’t plan for anything eighteen-fold.
“Tokenmaxxing” is lines-of-code thinking
Somewhere along the way, token consumption becomes a measurement of productivity.
The word that emerged for it is “tokenmaxxing”, treating the number of tokens you burn as though it were a numeric representation of how much work you are doing. Big spend, big output, big engineering. It went mainstream, got its think-pieces, and then got its inevitable takedown a few weeks later.
I’ll sound unkind about this for one paragraph, because it deserves it. Tokenmaxxing is the lines-of-code metric wearing a new hat. We spent decades learning that measuring an engineer by the lines they wrote was not just useless but actively perverse; it rewarded the verbose, punished the engineer who deleted a hundred lines and fixed hard-to-catch bugs that ended up being one-liners, and dressed up activity as value. In fact, oftentimes, the fewer lines of code to achieve something the better. Token count is the same mistake with a more expensive consequence. It measures the input and calls it the output.
The data is not subtle about this, either. Yes, the heaviest token users tend to show higher raw throughput, but they spend roughly ten times the tokens to get roughly twice the output, and the back half of that trade is ugly. In the highest-adoption environments, the studies are reporting bugs per developer up by half, code review times stretching several times longer, and code churn, the stuff you write and then immediately rewrite, rising by an order of magnitude. Throughput went up. So did the mess. And so did the bugs.
In Part 1, we described the agent as a junior engineer with perfect memory (most of the time). That framing still holds, with one update for this post: the junior engineer now has a metered salary. They bill you for thinking. And “worked really hard” was never the line on the performance review that mattered. “Shipped the right thing” was. What people in our industry call “Impact”.
Context discipline is cost discipline
Now here’s why we’re writing this, and this is the whole reason this is Part 3 of this series and not a generic finance post.
Everything we built across Parts 1 and 2, for correctness, for consistency, for the sanity of the engineer reviewing the agent’s work, turns out to have been cost engineering we did by accident. The discipline that made the agent good is, almost line for line, the discipline that makes the agent cheap. (Almost. Hold that “almost”; it comes back in the honesty section, and it matters).
Walk back through the machinery with a meter running.
Scoped rules - In Part 1, we argued against the monolithic AGENTS.md in favour of small, focused rules files the agent loads only when they are relevant. We made that case on the grounds of precision. But a monolithic rules document is also a context tax, paid on every message of every session, whether a single line of it applies to the task at hand. Scoped rules mean the agent carries only what it needs. Cheaper and sharper. The same decision, billed twice in your favour.
Commands - A command is a recipe. The single most expensive failure mode in agentic coding is the agent wandering, and, during this wandering, it re-reads half the monorepo, grepping for things it was told to avoid, stuffing its window with the very context you didn’t want. (Do not think of Donald Duck. You did. So does the agent). A command constraints the path. Constraining the path constrains the spend.
Rich JSDoc - Poor documentation means the agent guesses, and a guess means a guess-and-retry loop, and every lap of that loop to one pass. In Part 1, the equation was better docs = better AI. There is now a third line under it: better AI environment = a smaller bill.
Plans - Catching a wrong architectural direction in a plan review costs you a few hundred tokens and five minutes of a human’s attention. Catching the same mistake after the agent has executed it across a dozen files costs you hundreds of thousands of tokens and a bad afternoon. Plans were always cost control. We just called it “review” because that sounded more respectable.
In Part 1, I kept returning to a phrase: “one artefact, two consumers”. The human who reads the rule and understands it, the agent who reads it and is bound by it. This post adds a third consumer to the same artefact, and they sit in finance.
Why use many token when few token do trick
Everything above is the input side of the ledger. Now the half almost nobody optimises.
In an agent loop, today’s output is tomorrow’s input. The reply the model generates this turn gets fed back into the context of the next message, and the one after that. This means a verbose response is not a one-time cost. It is a cost that compounds. You pay for the padding on the answer it’s written and on every message it’s carried forwards. Output compression pays twice, and then it keeps paying.
This is where Caveman comes in, and it is the kind of tool I like: small, sharp, and slightly silly. It is an open-source skill by Julius Brussee whose entire premise is in the name, “why use many token when few token do trick”. It instructs the model to drop the articles (thank you, Latin), the pleasantries, the connective fluff that makes prose read nicely to a human, and to keep the technical substance untouched. The maintainer’s own measurements against the Claude API put the average output reduction around sixty-five per cent, with a wide spread across prompts. And it is not just the author’s number: Elastic ran an independent test across eight live tool-calling scenarios and landed at roughly sixty-four per cent fewer tokens with, and this is the part that matters, no loss of technical accuracy. The field names survived. The query syntax survived. Only the glue words got dropped out.
The mechanism is the interesting bit. The reason brevity is safe here is the same reason it is valuable: the agent loop doesn’t need to prose to be pleasant, it needs it to be correct, and a stripped-down reply carries the same signal at a third of the cost. Two compilers don’t exchange pleasantries. Neither should two turns of an agent.
We run it with a mode toggle instead of wholesale. And, this is the honest caveat, we turn it off for some work. Exploratory sessions, where you want the agent’s reasoning visible because the reasoning is the point. Onboarding, where a new engineer learning what the agent does will find terse output genuinely harder to follow. Anything where the trail matters more than the cost. Which rubs a little against the thing we mentioned in Part 1 about consistency being a kindness to your future team, terseness is not always kind.
The resolution we landed on is a boundary: compress the chatter, never the artefacts. The code stays rich. The docs stay rich. The stories stay rich. It is the conversation around them that goes caveman.
The levers we actually pull
Caveman is the quick win. Around it sits the systematic techniques, none of it glamorous, all of it load-bearing.
Model routing - Frontier models for frontier tasks, and something cheaper for the rest. Most of what an agent does in a day doesn’t need the most expensive model on the menu, and the gap between everything-on-frontier and a sensible tiered approach is not small; one analysis of billions of enterprise API calls put a tiered architecture at a blended rate something like an order of magnitude below running the top model for every call. It is the highest-leverage decision most teams make once and then never revisit.
Prompt caching and stable prefixes - Your system prompt and your rules get resent on every single call. Keep the stable part of the context actually stable. Same bytes. Same order. And the cache amortises it instead of charging you full freight each turn. Reordering your prompt for no reason is setting money on fire quietly.
Lean tools and lean tool responses - This one we learnt with our own MCP work, and it is going in a future post properly, but the short version belongs here: a tool that returns a fifty-field JSON blob when the agent needed three fields is a ‘tax’ billed on every message that blob lives in the context. Auto-generating one tool per API endpoint felt efficient and produced an agent drowning in options and verbose returns. Fewer, purpose-built functions that return only what’s needed was a cost decision as much as an ergonomics one.
Plans, again. Yes, this is the same plans from two sections ago. It belongs in the toolkit as much as in the philosophy. The cheapest mistake is the one caught before it executes.
What the bill & cap taught us
Engineering blog posts that only describe successes are marketing. So here is the uncomfortable middle of this one.
We wrote two posts about disciplined agentic workflows. We meant every word. And the bill still got us. That is the thing I most want you to take from this section, because it took us a while to say it plainly to ourselves: discipline made the agent correct. It did not make it cheap. Those turned out to be different problems. A perfectly-scoped, well-documented, plan-driven agent can still loop expensively on a hard task. Correctness and cost are correlated, not identical, and we had quietly assumed they were the same thing.
So we rationed. We put spend caps in place, and they did what caps do. Blunt, effective, a bit demoralising at first. The point is that we don’t need to use AI for every change in the codebase. The caps were meant to be the pressure. They forced engineering to find alternatives for more complex work.
Caveman on the output side, tighter context discipline on the input side, routing and caching around the edges. Consumption came back under the ceiling without us giving up the workflow that made the agent worth having in the first place.
And now the part that will sound familiar if you read Part 1. In that post, I admitted we had not measured agentic coding’s effect on velocity rigorously. We believed in it but would be sceptical of anyone selling a precise multiplier. The same honesty applies here, and it should. I cannot hand you a savings percentage. I can tell you we felt it, and I can tell you the caps stopped being hit because, as I said, you don’t need an AI running to change one line. That is a signal. If anyone offers you a tidy “we cut costs by X per cent with one weird trick”, treat it precisely the way you’d treat a tidy velocity multiplier, which is to say, with your hand on your wallet.
Rationing versus engineering
Here is the distinction the whole post has been walking towards.
Rationing is turning off the office lights to save money. It works. It is immediate. And it caps your upside along with your waste; the lights were, after all, helping people see. Spend caps, tiered access, revoked licences: all of it treats the symptom. The bill is too high, so you consume less. Fine. Necessary, even, in the moment. Engineering is needing less light to do the same work.
The industry is busy building the accounting muscle right now. The Linux Foundation stood up a body to bring proper cost discipline to tokens, and the share of finance functions that now own AI spend has gone from a minority to nearly everyone in the space of a year. This is good and overdue. But accounting tells you where the money went. It doesn’t tell you how to need less of it. That second question is an engineering question, and it has an engineering answer, and we have been writing that answer down for three posts now: scope the context, constrain the agent, compress the output, review the plan. Architecturally constrained agents are cheaper agents, not as a happy accident but as a direct consequence.
For us, the order went: rationing bought the time, engineering bought the budget back.
So, in the style of this series, the equation is
Cheaper tokens ≠ cheaper AI.
Disciplined context ≠ cheaper AI.
DISCLAIMER. The information we provide here is for informational purposes only and is not intended in any way to represent legal advice or a legal opinion that you can rely on. It is your sole responsibility to consult an attorney to resolve any legal issues related to this information.





