Diffusion LLMs Explained: The Fast New Text Models
How diffusion language models like LLaDA and Mercury generate text in parallel for huge speedups, and how they differ from GPT-style models.

Every mainstream chatbot you have used generates text the same way: one token after another, left to right, each word conditioned on the ones before it. That autoregressive approach is the reason long answers feel slow, the model literally cannot produce word 200 until it has produced word 199. Diffusion language models throw out that constraint. They generate all the positions at once and refine them together, and in 2026 models like LLaDA and Mercury have shown the payoff can be several times faster generation at nearly matching quality.
Quick answer
A diffusion language model (dLLM) starts from a fully masked sequence and unmasks all positions in parallel over a handful of denoising steps, instead of predicting one token at a time. That breaks the left-to-right bottleneck and delivers roughly 2x to 10x lower latency at small batch sizes. Mercury's 7B model reportedly hit about 1,100 tokens per second versus about 240 for a comparable autoregressive 8B, with only a ~1-point MMLU gap. They use bidirectional attention and usually no KV cache, so the speed comes from parallel decoding, not cheaper attention. Promising for latency-critical work, but the ecosystem is still young.
Key takeaways
- Diffusion language models (dLLMs) start from a fully masked sequence and unmask all positions in parallel over a series of denoising steps.
- This breaks the one-token-at-a-time bottleneck of autoregressive models, enabling 2-10x lower latency at small batch sizes.
- Mercury's 7B reportedly hit ~1,100 tokens/sec versus ~240 for a comparable autoregressive 8B, with only a small quality gap on MMLU.
- They use bidirectional attention and typically no KV cache, a fundamentally different architecture from GPT-style transformers.
- They are early, fewer open checkpoints, less tooling, but a genuinely promising direction for latency-critical applications.
The autoregressive bottleneck
A standard LLM is autoregressive: it predicts the next token given everything so far, appends it, and repeats. This is inherently sequential. No matter how powerful your hardware, the chain of dependencies forces the model to wait for each token before starting the next. For short replies it is fine; for long generations it is the dominant cost of waiting.
Diffusion models, borrowed conceptually from image generation, sidestep this by refining a whole sequence at once instead of extending it one token at a time.
How diffusion text generation works
The process inverts the usual mental model. Rather than building text from left to right, a dLLM:
- Starts with a sequence that is fully masked, every position is a blank to be filled.
- Runs a series of denoising steps. At each step, the model looks at the entire sequence and unmasks the tokens it is now confident about, refining all positions simultaneously.
- Repeats until the whole sequence is filled in and coherent.
Because every position is being refined in parallel at each step, the model is not waiting on a left-to-right chain. The number of denoising steps is far smaller than the number of tokens, which is where the speed comes from.

The architecture is genuinely different
These are not autoregressive transformers with a trick bolted on. LLaDA, for example, is described as BERT-like, using bidirectional attention, every token can attend to tokens on both sides, because the model is refining the whole sequence rather than predicting the future. That also means it typically does not use a KV cache, the standard optimization that lets autoregressive models reuse past computation.
That difference cuts both ways. Bidirectional attention can help coherence and editing, but losing the KV cache removes a major efficiency trick of the autoregressive world, so the speed advantage of dLLMs comes specifically from parallel decoding rather than from cheaper attention.
The numbers so far
The 2026 results are striking. Mercury, a commercial dLLM from Inception Labs, reported its 7B model generating around 1,100 tokens per second while scoring 71.9 on MMLU, against a comparable autoregressive 8B model at roughly 240 tokens per second and 73.1 MMLU. That is about 4.6x faster for a 1.2-point quality difference. The published latency advantage at small batch sizes lands in the 2-10x range.
On the open side, LLaDA-8B checkpoints, both base and instruct, are available to experiment with, and a growing body of research is pushing dLLM quality and speed further.
The differences from a GPT-style transformer are structural, not cosmetic. This is the side-by-side that matters when you are deciding whether a dLLM fits your workload:
| Property | Autoregressive (GPT-style) | Diffusion (dLLM) |
|---|---|---|
| Generation order | One token, left to right | All positions in parallel, refined over steps |
| Attention | Causal (past only) | Bidirectional (both directions) |
| KV cache | Yes, reuses past computation | Usually none |
| Latency at batch size 1 | Baseline | Roughly 2x to 10x lower |
| Big-batch server throughput | Strong, mature tricks | Weaker, less tooling |
| Ecosystem maturity (mid-2026) | Deep | Early, thin |
Tip
Diffusion LLMs shine brightest at small batch sizes, where their parallelism is not competing with the throughput tricks that favor autoregressive models in big-batch server settings. If your bottleneck is per-request latency rather than aggregate throughput, they are worth benchmarking.
Should you use one yet?
Honestly, for most production systems in mid-2026, not yet as a default, but keep watching. The ecosystem is young: fewer battle-tested open checkpoints, thinner tooling, and less community knowledge than the mature autoregressive stack. The architectures that today's serving engines are built around are autoregressive, so deployment is less turnkey.
Where dLLMs earn a serious look is latency-critical work, interactive coding assistants, real-time agents, where the parallel-decoding speedup directly improves the experience. Mercury Coder targeting code generation is exactly this bet. For everything else, the autoregressive models covered in the best open-weight LLMs remain the safe choice, and the speed tricks in speculative decoding narrow the latency gap from the autoregressive side.
What to do right now
If you are deciding whether to invest time in dLLMs today, run this short checklist:
- Is per-request latency your bottleneck, not aggregate throughput? If yes, benchmark a dLLM; if no, stay autoregressive.
- Pull LLaDA-8B (base or instruct) for a free, open experiment before committing to a paid API.
- Try Mercury via API if you want a turnkey commercial dLLM, especially Mercury Coder for code generation.
- Benchmark at batch size 1, where the parallel-decoding advantage actually shows up.
- Do not rip out a working autoregressive stack for a default workload yet; the tooling gap is real.
Frequently asked questions
How do diffusion LLMs differ from diffusion image models?
The core idea is shared, start from noise or masks and denoise iteratively, but the data differs. Image diffusion refines pixels; text diffusion refines tokens, unmasking positions over denoising steps. Text diffusion also has to produce discrete tokens rather than continuous pixel values, which changes the mechanics.
Are diffusion LLMs better than GPT-style models?
Not strictly better, different. Their advantage is parallel decoding, which gives much lower latency at small batch sizes. They currently trail the best autoregressive models slightly on some quality benchmarks and have a far less mature ecosystem. The trade-off favors them most when per-request speed is the priority.
Why don't diffusion LLMs use a KV cache?
The KV cache is an optimization for autoregressive models that reuse computation from earlier tokens as they extend a sequence left to right. Diffusion models refine the whole sequence at once with bidirectional attention, so there is no left-to-right history to cache in the same way. Their speed comes from parallelism instead.
Can I run a diffusion LLM today?
Yes, on the open side. LLaDA-8B base and instruct checkpoints are published and can be deployed on GPU cloud, and commercial options like Mercury exist via API. Tooling is thinner than for autoregressive models, so expect more manual setup than running a mainstream open-weight model.


