The End of Tokenmaxxing: Why Companies Are Suddenly Counting Every AI Token

After years of spend-at-all-costs AI adoption, companies are demanding ROI, capping budgets, and switching to cheaper models.

Sam CarterJun 28, 2026 9 min read

Cover image for The End of Tokenmaxxing: Why Companies Are Suddenly Counting Every AI Token — Photo: jurvetson / flickr (BY 2.0)

For two years, the unofficial rule inside many engineering organizations was simple: use as much AI as you possibly can. As of late June 2026, that rule is being rewritten. CNBC reports that companies are pivoting hard from what insiders call "tokenmaxxing" toward efficiency, ROI, and tighter spending controls, and the shift is putting real pressure on the model labs that benefited most from the old mindset.

The numbers behind the turn are striking. The average enterprise AI budget has ballooned from roughly $1.2M a year in 2024 to about $7M in 2026, even as the price of an individual token has fallen more than 90 percent since 2023. The bills came due, and finance teams started asking what, exactly, they were paying for.

Quick answer

Tokenmaxxing, the spend-at-all-costs culture that treated token usage as a proxy for productivity, is ending because total AI bills doubled even as per-token prices fell 90 percent (the Jevons paradox: agentic workflows burn 100x to 1,000x more tokens per task). Companies like Uber are now imposing per-employee caps, cheaper models like DeepSeek V4 and Gemini 3.5 Flash are winning real production traffic, and the metric that matters has shifted from tokens consumed to cost per successful task. Teams that add routing, caching, and batching report cutting bills 60 to 85 percent with no quality loss.

Key takeaways

"Tokenmaxxing", the spend-at-all-costs culture where token usage was treated as a proxy for productivity, is ending as companies demand measurable ROI.
Per-token prices have collapsed, but total spend is up because agentic workflows burn 100x to 1,000x more tokens per task. This is the Jevons paradox applied to AI.
Concrete crackdowns are here: Uber blew its annual AI budget in four months and imposed per-employee caps; Microsoft revoked some Claude Code licenses; MIT found ~95% of generative AI pilots produced no measurable profit.
Cheaper models are winning real production traffic. DeepSeek made a 75 percent price cut permanent in May 2026, and Google is pushing Gemini 3.5 Flash "for everything."
Teams that apply routing, caching, and batching report cutting AI bills 60 to 85 percent with no visible quality loss. The metric that matters now is cost per successful task.

What "tokenmaxxing" was

Tokenmaxxing describes the spend-at-all-costs era of AI adoption, where employers actively incentivized developers to consume as many tokens as possible without scrutinizing the results. The logic was that frontier models were improving so fast that the bottleneck was human reluctance, not cost, so the smart move was to remove every barrier to usage.

That mentality fueled extraordinary growth. OpenAI and Anthropic were the principal beneficiaries, with consumption-driven revenue pushing both toward valuations approaching a trillion dollars. Both companies filed confidentially in early June for potentially historic IPOs. Even OpenAI's Sam Altman has conceded that token costs are becoming "a huge issue," with overspending turning into an industry meme.

Why the bill went up while prices went down

Here is the paradox that caught finance teams off guard. Per-token prices fell roughly 10x over two years, yet aggregate AI spending has doubled since late 2025. The reason is structural: a single-turn chat query and an autonomous agent are not the same workload. Agentic pipelines loop, retry, call tools, and re-read context, so they consume 100x to 1,000x more tokens per completed task than a one-shot prompt.

Economists have a name for this. When the unit cost of a resource drops, consumption often rises faster than the price falls, so total expenditure climbs. It is the Jevons paradox, first described for coal in 1865 and now playing out in inference. Cheaper intelligence did not make AI cheaper to operate. It made AI cheap enough to deploy everywhere, and "everywhere" has a budget.

Rows of server racks in a data center representing AI inference infrastructure — Photo: skreuzer / flickr (BY-NC-SA 2.0)

The crackdown arrives

The mood has changed, and the examples are concrete. According to CNBC, Uber implemented spending tiers on some AI tools, starting at a base of roughly $1,500 per month, with employees needing to request higher allowances. That move followed an admission from Uber's CTO that the company burned through its entire annual AI budget in just four months. Microsoft, meanwhile, revoked some developers' Claude Code licenses months after enabling them.

The discipline reflex is backed by sobering ROI data. An MIT analysis found that roughly 95 percent of generative AI pilots produced no measurable profit. S&P Global reported that 42 percent of companies abandoned most of their AI projects in 2025, and IBM put the share of deployments hitting expected ROI near 25 percent. The free-for-all is becoming a managed line item.

Note

"Tokenmaxxing" is industry slang, not an official term. It captures a culture where token usage itself was treated as a proxy for productivity, regardless of whether the output justified the spend.

Cheaper models are winning real traffic

The clearest signal is where the workloads are going. CNBC highlights the CEO of AI startup Lindy, who moved his company entirely off Anthropic's Claude models, shifting 100 percent of traffic to DeepSeek, a Chinese lab offering cheaper, open-weight alternatives. When a company swaps its entire inference stack to cut cost, that is not an experiment, it is a verdict.

The price gap is now hard to ignore. On May 22, 2026, DeepSeek made its 75 percent promotional discount on V4-Pro permanent: input dropped from $1.74 to $0.435 per million tokens and output from $3.48 to $0.87 per million. Its V4 Flash tier sits even lower, around $0.14 input and $0.28 output per million. For volume-heavy production tasks, classification, extraction, routine code edits, a model that costs an order of magnitude less and is not meaningfully worse changes the math across millions of calls.

The spread across tiers is what makes routing pay off. Approximate June 2026 list prices per million tokens:

Model tier	Example	Input	Output	Best for
Frontier	Premium reasoning model	$3 to $15	$15 to $75	Hard reasoning, high-stakes calls
Mid	Gemini 3.5 Flash	~$0.30	~$1.20	Fast default, agentic loops
Budget	DeepSeek V4-Pro	$0.435	$0.87	High-volume code and analysis
Ultra-budget	DeepSeek V4 Flash	~$0.14	~$0.28	Extraction, classification
Cache hit	DeepSeek cached prefix	~$0.0028	n/a	Stable system prompts, few-shot

This is also where small language models enter the picture. For narrowly scoped tasks, a fine-tuned small model running on cheap hardware can beat a frontier API on both latency and cost, and teams running their own local inference engines can drive marginal cost close to electricity.

The labs are responding to the same signal

The model providers see the shift and are repositioning around it. When Google launched Gemini 3.5 Flash at I/O in May 2026, developer Simon Willison summarized Google's strategy bluntly: the company plans to use Flash "for everything." Flash became the default model in the Gemini app and in Search's AI Mode, precisely because a fast, cheaper model that performs well on agentic and coding benchmarks is what cost-conscious customers now want.

In other words, the labs are no longer competing only on who has the biggest frontier model. They are competing on cost per useful result, because that is the metric their customers have started measuring. The same recalibration is visible in coding tools, where the choice between premium and budget assistants, a theme we covered in Claude Code vs Cursor, increasingly comes down to tokens burned per merged pull request, not raw capability.

Warning

A cheaper model is only cheaper if it gets the job done in one pass. A low-cost model that needs three retries or heavy human correction can quietly cost more than a pricier one that succeeds the first time. Measure end-to-end, not per-token.

What this means for teams

If your organization rode the tokenmaxxing wave, now is the moment to audit. Production teams that apply the moves below report cutting AI bills 60 to 85 percent with no visible quality loss.

Route every request to the cheapest model that can handle it

Reserve frontier models for genuinely hard reasoning and send routine work to smaller, faster, cheaper models. A minimal router looks like this:

def route(task):
    if task.needs_deep_reasoning or task.high_stakes:
        return "frontier-model"      # expensive, rare
    if task.is_structured_extraction:
        return "deepseek-v4-flash"   # cheap, high volume
    return "gemini-3.5-flash"        # fast default

model = route(task)
response = call(model, task.prompt)

The volume-heavy steps that drive the bill are usually not the steps that need frontier reasoning.

Cache and batch aggressively

Prompt caching alone saves 30 to 50 percent on repeated prefixes, and providers price cache hits at a steep discount, DeepSeek charges roughly $0.0028 per million for a cache hit versus $0.14 on a miss, about a 98 percent reduction. Stable system prompts and few-shot examples are prime candidates. Batch non-urgent jobs to take advantage of off-peak and batch-tier pricing.

Instrument spend and measure cost per successful task

Uber's four-month blowout happened in part because usage outran visibility. Track tokens by team and by feature, then attach them to outcomes. The metric that aligns finance and engineering is cost per successful task: the all-in spend to resolve a ticket, merge a pull request, or qualify a lead, including retries and failed attempts. Pair this with LLM-as-a-judge evals so quality is measured continuously, not assumed, when you downgrade a model.

Trim the context you send

Bloated prompts are pure waste at agentic scale. Smarter agent memory, retrieving only the context a step actually needs instead of re-stuffing the entire history each turn, cuts input tokens directly, which is often where the bill quietly lives.

What to do right now

If your AI bill is climbing and nobody can say why, run this audit this week:

Pull last month's token usage broken down by team and by feature, not just one company total.
Identify the three highest-volume call paths and check whether any actually need a frontier model.
Turn on prompt caching for every stable system prompt and few-shot block before touching anything else.
Add a simple router that sends extraction and classification to a budget model like DeepSeek V4 Flash.
Define "cost per successful task" for one workflow and start tracking it, including retries.
Stand up LLM-as-a-judge evals so any model downgrade is measured, not assumed.
Set a per-team monthly cap with a request path for more, the way Uber did, so usage cannot silently run away.

Frequently asked questions

Is tokenmaxxing actually over, or just slowing down?

Adoption is still climbing; what changed is the spending posture. Companies are not using less AI, they are scrutinizing every dollar of it. The era where token consumption itself counted as a win is ending, replaced by ROI gates, per-employee caps, and model-routing policies.

If tokens are cheaper, why is my AI bill going up?

Because of the Jevons paradox. As the unit price of inference falls, teams deploy far more of it, more agents, more automated workflows, more generated code, and aggregate usage rises faster than the price drops. A 90 percent price cut does not help if your token volume grows 20x.

Should we switch everything to a cheap model like DeepSeek or Gemini Flash?

Not blindly. Route by task. Cheap models excel at high-volume, well-defined work like extraction and classification, while genuinely hard reasoning still benefits from frontier models. The win is matching each task to the cheapest model that succeeds in one pass, then measuring quality with evals so a downgrade does not silently raise your retry rate.

What is the single most effective cost lever?

Prompt and response caching, followed closely by model routing. Caching repeated prefixes can cut input costs by 30 to 50 percent or more with almost no engineering risk, and it stacks with routing. Together they account for most of the 60-to-85 percent savings teams report.

The defining question is no longer "can the model do this," but "is the output worth what it costs." That reframing is the entire story of mid-2026: AI is shifting from a land grab into a discipline, and the teams that adapt early will spend less while shipping just as much.

#ai#tools#cost-optimization#llm