Prefill-Decode Disaggregation for LLM Serving (2026)

Splitting prefill and decode onto separate GPU pools is the new default for high-throughput LLM serving. Here is why it cuts latency and cost.

Sam CarterJun 23, 2026, 5:20 PM 8 min read

Cover image for Prefill-Decode Disaggregation for LLM Serving (2026) — Photo: Gknor / wikimedia (BY-SA 4.0)

For two years the standard way to serve a large language model was to run prompt processing and token generation on the same GPU. In 2026 the frontier stacks stopped doing that, and the reason is simple: those two phases want opposite hardware.

Quick answer

Prefill-decode disaggregation runs the two phases of LLM inference on separate GPU pools. Prefill (processing the prompt) is compute-bound and wants raw FLOPs; decode (generating tokens one at a time) is memory-bandwidth-bound and wants fast memory. Splitting them lets each pool be sized and tuned independently, which removes head-of-line blocking, stabilizes tail latency, and raises tokens-per-dollar. The KV cache is transferred between pools over a fast interconnect.

Key takeaways

Prefill and decode have opposite bottlenecks. One is compute-bound, the other memory-bandwidth-bound.
Colocating them causes interference. A long prompt stalls token generation for every other request on the GPU.
Disaggregation gives each phase its own pool, so you scale prefill and decode independently.
The KV cache is the medium of exchange, shipped from prefill workers to decode workers over a fast link.
It is now the default deployment shape in production stacks and NVIDIA Dynamo on Blackwell.

Why one GPU cannot serve both phases well

An LLM request has two distinct stages. Prefill reads the entire prompt in a single forward pass and builds the key-value (KV) cache. This is highly parallel and saturates the GPU's compute units. Decode then generates output tokens one at a time, each step reading the whole KV cache back from memory. Decode barely touches the compute units but hammers memory bandwidth.

When both run on the same GPU, they fight. A request with a 100,000-token prompt monopolizes the compute units during prefill, and every decode step for every other in-flight request waits behind it. This is head-of-line blocking, and it is the single biggest cause of unstable tail latency in colocated serving.

Phase	Bottleneck	What it wants	Batching behavior
Prefill	Compute (FLOPs)	Many tensor cores	Batches poorly across varied prompt lengths
Decode	Memory bandwidth	Fast HBM, large KV room	Batches extremely well, many requests at once

Because the two phases want different things, tuning a colocated server is a compromise that satisfies neither. You either underuse compute during long decode phases or starve decode latency during heavy prefill.

Rows of GPU servers in a data center running LLM inference — Photo: 紅色死神 / flickr (BY-NC-SA 2.0)

How disaggregation works

The fix is architectural. You run two fleets:

Prefill workers optimized for compute throughput. They accept prompts, run the single forward pass, and produce the KV cache.
Decode workers optimized for memory bandwidth and KV capacity. They receive the cache and stream tokens back to the user.

The KV cache is transferred from a prefill worker to a decode worker over a fast interconnect such as NVLink or InfiniBand. Production systems at several labs, and NVIDIA's Dynamo runtime on Blackwell, ship this as the default. The transfer cost is real but small relative to the interference it eliminates.

The KV cache is the interface

Disaggregation only pays off because the KV cache is a clean handoff point. Once prefill finishes, everything the decode phase needs is in that cache. Some stacks go further and disaggregate the KV cache itself into a separate memory tier, so a cache produced once can be reused by many decode workers or across requests that share a prefix.

Where the wins come from

No head-of-line blocking. Long prompts no longer stall unrelated decode steps.
Independent scaling. A workload heavy on long prompts adds prefill GPUs; a chatty workload adds decode GPUs.
Better batching. Decode workers batch dozens of requests together because they all do the same tiny per-step work.
Higher GPU utilization. Each pool runs closer to its own hardware ceiling.

When it is worth the complexity

Disaggregation adds moving parts: two fleets, a scheduler, and a cache transfer path. It is not free, and small deployments should not reach for it first.

Situation	Recommendation
Single GPU, low traffic	Stay colocated; disaggregation overhead is not worth it
Mixed long and short prompts at scale	Disaggregate; interference is your main pain
Very long shared prefixes (RAG, system prompts)	Disaggregate and add cross-request KV reuse
Latency-sensitive chat at high volume	Disaggregate to protect decode tail latency

If your prompts are short and uniform, colocated serving with continuous batching may already be near optimal. The gains from disaggregation grow with prompt-length variance and traffic.

What to do right now

Measure your prefill-to-decode ratio. Log prompt tokens versus generated tokens per request; the skew tells you which pool to size larger.
Check your tail latency, not the average. Head-of-line blocking shows up at p95 and p99, not the mean.
Turn on continuous batching first if you have not; it is the cheaper win and a prerequisite for good decode batching.
Try prompt caching before adding hardware, since reusing a KV cache avoids prefill entirely for repeated prefixes. Our guide to semantic caching for LLM apps covers the application-layer version.
Evaluate a runtime that supports disaggregation natively rather than building the cache-transfer plumbing yourself.
Stack the other layers too. See KV cache optimization for LLM inference and which inference engine to pick.

Frequently asked questions

Does disaggregation lower cost or just latency?

Both, but by different mechanisms. Latency improves because decode stops waiting behind prefill. Cost improves because each pool runs at higher utilization and you stop overprovisioning one GPU type to cover both phases.

Is the KV cache transfer expensive?

It is a real cost, but on a fast interconnect it is small compared to the interference and idle time it removes. Stacks that reuse the cache across requests recover the transfer cost many times over.

Do I need Blackwell GPUs for this?

No. Disaggregation is an architectural pattern and works across GPU generations. Newer hardware and runtimes like NVIDIA Dynamo make it easier to adopt, but the idea predates them.

How does this relate to speculative decoding?

They are complementary. Disaggregation is about where the two phases run; speculative decoding speeds up the decode phase itself by drafting several tokens at once. You can and often should use both.

#llm-inference#gpu#serving