Skip to content
WhySoGeek.
AI

KV Cache Optimization: Faster LLM Serving in 2026

What the KV cache is, why it eats GPU memory, and how PagedAttention, GQA, and quantization cut waste for cheaper LLM inference in 2026.

Sam Carter 8 min read
Cover image for KV Cache Optimization: Faster LLM Serving in 2026
Photo: brewbooks / flickr (BY-SA 2.0)

When people complain that self-hosting an LLM is expensive, the hidden culprit is usually the KV cache. It is the memory buffer that makes generation fast, but it grows with every token in every concurrent request, and on a busy server it can consume more GPU memory than the model weights themselves. Understanding and optimizing the KV cache is the single biggest lever for serving more users on the same hardware in 2026.

Quick answer

The KV cache stores the attention keys and values the model already computed so it never recomputes them, which is what makes generation fast, but its size grows with sequence length and batch size and often exceeds the model weights on a busy server. The biggest win is PagedAttention (from vLLM), which manages the cache in small blocks like OS virtual memory, cutting waste from 60 to 80 percent down to under 4 percent and lifting throughput 2 to 4x. Architecture choices like grouped-query attention shrink the cache further, and quantization, eviction, and offloading trim memory at runtime.

Key takeaways

  • The KV cache stores previously computed attention keys and values so the model never recomputes them, turning per-step attention cost from quadratic to linear.
  • Its size grows with sequence length and batch size, so it often dominates GPU memory on a busy server.
  • PagedAttention stores the cache in non-contiguous blocks like OS virtual memory, cutting waste from 60-80% to under 4% and lifting throughput 2-4x.
  • Grouped-query attention (GQA) and multi-query attention (MQA) shrink the cache by sharing key/value heads across query heads.
  • Quantization, eviction, and offloading trade a little quality or speed for much smaller memory.

What the KV cache is

During generation, the attention mechanism computes a key and a value for every token. Without caching, producing each new token would require recomputing keys and values for the entire sequence so far, wasteful, since those tokens have not changed. The KV cache stores those tensors once and reuses them, so generating token N only computes the new token's attention against the cached history.

This is what makes autoregressive generation tractable. It reduces the per-step attention work from quadratic in sequence length to linear. The price is memory: every cached key and value has to live in GPU memory for the life of the request.

Why it dominates memory

The cache's footprint scales with several factors at once: the number of layers, the number of attention heads, the sequence length, and the number of concurrent requests in the batch. Long contexts and high concurrency, exactly what production serving wants, multiply these together. On a busy server, the KV cache routinely consumes more memory than the model weights, and when it runs out, you cannot admit more requests. KV cache memory, not raw compute, is often the binding constraint on how many users you can serve.

A GPU memory chip with data organized into discrete blocks
Photo: jurvetson / flickr (BY 2.0)

PagedAttention: the breakthrough

The classic problem was fragmentation. Naive serving allocates a contiguous block of memory for each request's maximum possible length, wasting 60-80% of it because most requests are shorter. PagedAttention, introduced with vLLM, borrows the operating-system idea of virtual memory: it splits the cache into fixed-size blocks (typically 16 tokens) and allocates them on demand as a sequence grows, with a block table mapping logical positions to physical memory.

The effect is dramatic. Memory waste drops from 60-80% to under 4%, which means far more requests fit in the same GPU and throughput rises 2-4x. PagedAttention is now standard in serious serving stacks, and most of the inference engines compared in vLLM vs Ollama vs llama.cpp implement it or an equivalent.

Tip

If you are serving an LLM and hitting out-of-memory errors or low concurrency, check whether your stack uses paged KV-cache management before buying a bigger GPU. Switching to a serving engine with PagedAttention often multiplies your effective capacity on the hardware you already have.

Shrinking the cache at the architecture level

Beyond memory management, model architecture decides how big the cache is in the first place:

  • Multi-query attention (MQA) has all query heads share a single key and value head, shrinking the cache sharply at some quality cost.
  • Grouped-query attention (GQA) is the popular middle ground: query heads are split into groups, each sharing one key/value head. Most modern open-weight models use GQA precisely to keep the KV cache manageable.

These choices are baked into the model, so they matter when you select one, a GQA model is far cheaper to serve at long context than an older multi-head design.

Here is how the main levers compare, so you can pick the ones that fit your situation:

TechniqueWhat it doesMemory winCost / trade-offWhen to use
PagedAttentionBlock-based cache allocationWaste 60-80% to under 4%Needs a modern serving engineAlmost always; first thing to check
GQA / MQAShare key/value heads2x to 8x smaller cacheBaked into the model, slight qualityChoosing which model to serve
KV quantizationLower-precision keys/values~2x to 4x smallerMinor quality lossLong context, high concurrency
EvictionDrop low-value cached tokensBounded cache sizeQuality loss on long contextVery long sessions
OffloadingMove cache to CPU RAMFrees GPU memorySlower (PCIe transfers)GPU memory is the hard limit
Prefix cachingReuse shared prompt prefixSkips recompute of prefixOnly helps shared prefixesMany requests share a system prompt

Runtime tricks for tight memory

When architecture is fixed, serving systems apply further optimizations:

  • KV-cache quantization stores the cached keys and values at lower precision, cutting their memory with minor quality impact, the same idea as weight quantization applied to the cache.
  • Eviction drops less-important cached tokens to make room, accepting some quality loss on very long contexts.
  • Offloading moves parts of the cache to CPU memory when GPU memory is tight, trading speed for capacity.
  • Prefix caching reuses the cached representation of a shared prompt prefix across requests, which compounds with the prompt-caching cost savings discussed in AI coding agent costs.

What to do right now

If you are serving an LLM and want more concurrency or lower cost without new hardware, work this list:

  • Confirm your serving engine uses PagedAttention (or an equivalent block-based cache). If it does not, switching engines often multiplies your effective capacity for free. See vLLM vs Ollama vs llama.cpp.
  • Pick a GQA model when you have a choice, since it cuts the cache size at the architecture level before any runtime trick.
  • Enable prefix caching if your requests share a long system prompt, a very common and cheap win for agents and chat apps.
  • Turn on KV-cache quantization and validate quality on your own eval set; for long-context, high-concurrency workloads the memory saving usually outweighs the small precision loss.
  • Reach for offloading or eviction only when GPU memory is the hard ceiling, since both trade speed or quality for capacity.
  • Measure before buying a bigger GPU. KV-cache memory, not compute, is usually the binding constraint, so optimizing it first is the cheaper fix.

Frequently asked questions

Does the KV cache improve quality or just speed?

Just speed and cost, it does not change output quality. It is a pure optimization that avoids recomputing attention keys and values the model already calculated. The output with and without caching is identical; the cache only makes generation faster and cheaper.

Why does the KV cache use so much memory?

Because it stores keys and values for every token across every layer and attention head, for every concurrent request. Those dimensions multiply, so long contexts and many simultaneous users make it grow fast, often past the size of the model weights themselves on a busy server.

What is PagedAttention in simple terms?

It manages KV-cache memory the way an operating system manages RAM: in small fixed-size blocks allocated on demand rather than one big contiguous chunk per request. This nearly eliminates wasted memory, letting many more requests share a GPU and raising throughput several-fold.

Will quantizing the KV cache hurt my model?

It introduces a small precision loss, similar to weight quantization, but for most workloads the quality impact is minor while the memory saving is large. Test it on your own evaluation set; for long-context, high-concurrency serving the trade is usually well worth it.

How do I estimate KV-cache size for my model?

Roughly, the cache size is 2 (keys and values) times the number of layers, times the number of key/value heads, times the head dimension, times the sequence length, times the batch size, times the bytes per element. The "2 times layers times KV heads times head dim" part is per token, so doubling context or batch size doubles the cache. A GQA model with few KV heads is dramatically smaller here than a multi-head model, which is the whole point of GQA.

Does prefix caching change my outputs?

No. Like the base KV cache, prefix caching is a pure optimization: it reuses the computed representation of a shared prompt prefix so the model skips recomputing it, but the generated tokens are identical. It only helps when many requests actually share that prefix, such as a common system prompt across users.

#ai#inference

Sources & further reading

Keep reading