Skip to content
WhySoGeek.
AI

Prompt Caching Compared: OpenAI, Claude, Gemini

All three big providers now discount cached input 90%, but the write fees, minimums, and lifetimes differ. Here is which one wins your workload.

Sam Carter 8 min read
Cover image for Prompt Caching Compared: OpenAI, Claude, Gemini
Photo: Harvard Law Record / flickr (BY 2.0)

Prompt caching is the cheapest 90% you will ever save on an LLM bill, and by mid-2026 all three major providers offer it. The headline discount is identical, which means the real decision hides in the fine print: write fees, minimum sizes, cache lifetimes, and whether you have to manage it yourself.

Quick answer

As of mid-2026, OpenAI, Anthropic, and Google all discount cached input tokens by 90%. The differences are in the mechanics. OpenAI caches automatically with no separate write fee. Anthropic uses explicit cache_control breakpoints and charges a write surcharge (1.25x input for a 5-minute cache, 2x for one hour). Gemini offers both implicit automatic caching and explicit caching with a per-hour storage fee. For high-reuse workloads the write fee washes out and they tie; for low-reuse, OpenAI's no-write-fee model wins slightly.

Key takeaways

  • All three discount cached reads by 90%. The read economics are essentially tied.
  • The difference is write cost and control. OpenAI is automatic and free to write; Anthropic charges to write; Gemini charges storage per hour.
  • High reuse favors everyone equally because the write fee amortizes away.
  • Low reuse slightly favors OpenAI thanks to no write surcharge.
  • Structure your prompt for caching with the static content first, or you get nothing.

Why prompt caching pays off

Most production prompts repeat a large static block: a system prompt, tool definitions, few-shot examples, a retrieved document. Without caching, you pay full input price to reprocess that block on every call. Caching stores the processed prefix so subsequent calls read it back at a fraction of the cost.

The savings scale with how much of your prompt is reused. An agent that sends the same 8,000-token system prompt on every turn can cut its input bill dramatically, which matters because agentic workflows make many calls per task and input tokens dominate.

The provider mechanics

OpenAI: automatic, no write fee

OpenAI caches automatically. There is no cache_control to set and no separate write line item. The first call is billed at the standard input rate, and subsequent calls that match the same prefix are billed at the cached rate, a 90% discount. Simplicity is the selling point: you get caching without changing your code, as long as your static content sits at the front of the prompt.

Anthropic: explicit, with a write surcharge

Anthropic uses explicit caching. You mark cache breakpoints with cache_control. Writing the cache costs more than a normal input token: 1.25x input for the 5-minute time-to-live, or 2x for the 1-hour TTL. Reads then cost 0.10x input, the same 90% discount. The explicit model gives you control over exactly what is cached and for how long, at the price of a write surcharge and a bit of code.

Google Gemini: implicit and explicit

Gemini supports both. Implicit caching works automatically, similar to OpenAI. Explicit caching lets you cache content deliberately but adds a per-hour storage fee for keeping it alive. Cached tokens are billed at 10% of the input rate, again a 90% discount.

A bar chart comparing cached versus uncached input token costs across providers
Photo: Farcaster (talk) 01:51, 15 October 2008 (UTC) Original uploader was Farcaster at en.wikipedia / wikimedia (BY-SA 3.0)

Head to head

FactorOpenAIAnthropicGoogle Gemini
Cached read discount90%90%90%
Write feeNone1.25x (5 min) / 2x (1 hr)Storage fee per hour
ControlAutomaticExplicit breakpointsImplicit or explicit
Cache lifetimeProvider-managed5 min or 1 hr TTLConfigurable with storage cost
Best whenLow reuse, want zero setupHigh reuse, want controlMixed, want a choice

The breakeven that actually matters

The write fee only stings if you write more than you read. For high cache-hit workloads (roughly five or more reads per write cycle), Anthropic's write surcharge amortizes away and the providers tie on read economics. For low cache-hit workloads (under about three reads per write), OpenAI's no-write-fee structure wins by a few percent because you are not paying a surcharge you rarely recover.

WorkloadWinnerWhy
Chat with long shared system promptTieHigh reuse, write fee amortizes
One-off long-document Q&AOpenAIFew reads, no write fee to recover
Agent with stable tool defsTieReused every turn
Need control over what cachesAnthropicExplicit breakpoints

What to do right now

  • Put static content first. System prompt, tool definitions, and examples belong at the front so the reusable prefix is as long as possible.
  • Measure your reuse ratio. Count reads per write cycle; that number decides whether write fees matter to you.
  • Do not fragment the prefix. A single changing token near the top invalidates the whole cache. See our cache-miss fixes in the coding cost guide.
  • Stack caching with other levers. Combine it with semantic caching at the app layer and KV cache optimization if you self-host.
  • Verify against real usage. Log cache hit rates in production, since real traffic rarely matches the estimate.
  • Confirm current pricing before committing; only trust the provider's own pricing page for exact numbers.

Frequently asked questions

Do all three really give the same 90% discount?

On cached reads, yes, as of mid-2026. The read discount is 90% across OpenAI, Anthropic, and Gemini. The differences are in write cost, control, and cache lifetime, not the headline read rate.

Why does Anthropic charge to write the cache?

Because it caches explicitly and holds the prefix for a defined TTL. The write surcharge (1.25x or 2x input) covers that. If you read the cache several times before it expires, the surcharge is easily recovered.

What breaks a cache hit?

Any change to the cached prefix. If a token near the top of the prompt varies per request, the reusable prefix ends there and you cache far less. Keep everything dynamic at the bottom of the prompt.

Is caching worth it for one-off requests?

No. Caching pays off through reuse. A single unique request with no repeat gets no benefit, and on Anthropic you would even pay a write fee for nothing. It shines on repeated prefixes.

#llm#cost-optimization#api

Sources & further reading

Keep reading