Skip to content
WhySoGeek.
AI

Speculative Decoding: 2-4x Faster LLM Inference

How draft-and-verify speculative decoding speeds up LLM token generation 2-4x in 2026 with no loss in output quality.

Sam Carter 7 min read
Cover image for Speculative Decoding: 2-4x Faster LLM Inference
Photo: vonderauvisuals / flickr (BY-NC 2.0)

Large language models generate text one token at a time, and each token requires a full forward pass through billions of parameters. That sequential bottleneck is why long responses feel slow. Speculative decoding attacks it directly: instead of paying full price for every token, a small model races ahead and guesses several tokens, and the big model checks all the guesses in a single pass. When the guesses are good, you get multiple tokens for roughly the cost of one. By 2026 the technique has gone from research curiosity to production standard.

Quick answer

Speculative decoding speeds up LLM generation 2 to 4x by having a small, fast draft model propose several tokens and a large target model verify them all in one forward pass. Accepted tokens follow the target model's exact probability distribution, so output quality is mathematically identical, not approximated. It is built into vLLM, SGLang, and TensorRT-LLM, so most teams enable it with a config flag. Gains are largest on predictable, structured text, where acceptance rates often exceed 80%.

Key takeaways

  • Speculative decoding uses a small draft model to propose several tokens, then a large target model verifies them in parallel.
  • Accepted tokens follow the exact same probability distribution as normal decoding, so output quality is mathematically identical.
  • Real-world acceptance rates often exceed 80%, delivering 2-4x speedups in production.
  • It is built into vLLM, SGLang, and TensorRT-LLM, so most teams enable it with a config flag.
  • Gains depend on workload predictability; structured or repetitive text accelerates more than highly creative output.

The sequential bottleneck

Autoregressive generation is inherently serial. To produce token 50, the model needs token 49, which needs token 48, and so on. Each step lights up the entire network. The hardware spends most of its time moving weights in and out of memory rather than doing math, so a single token barely uses the GPU's compute capacity. That underutilization is the opening speculative decoding exploits.

Draft and verify

The mechanism is two models working together:

  1. A small, fast draft model generates a short run of candidate tokens, say four or five, by itself.
  2. The large target model takes those candidates and verifies them all in one forward pass, because checking a known sequence is parallelizable in a way that generating one is not.
  3. Tokens that match what the target model would have produced are accepted. At the first mismatch, the rest are discarded and the target model supplies the correct next token itself.

Because the verification step enforces the target model's own distribution, every accepted token is exactly what the big model would have generated on its own. This is the crucial property: it is a speedup, not an approximation. Any quality metric you measure comes out identical to standard decoding.

Streaks of fast-moving light racing through a dark tunnel suggesting acceleration
Photo: BryanAlexander / flickr (BY 2.0)

Where the speedup comes from

The win is amortization. One expensive target-model pass now confirms several tokens instead of one. If the draft model proposes four tokens and three are accepted, you produced four tokens in roughly the time two passes would normally take. Across a long response those savings compound.

Acceptance rate is the lever that decides everything. The closer the draft model's predictions track the target model, the more tokens survive verification and the bigger the speedup. Predictable text, boilerplate, structured output, common phrasing, gets accepted at high rates. Genuinely novel or creative continuations are harder to guess, so acceptance drops and the gain shrinks.

Tip

The draft model should be small enough to run far faster than the target, but accurate enough that its guesses are usually right. A common choice is a model from the same family, an order of magnitude smaller. Too large a draft model erases the speedup; too weak a one gets everything rejected.

Variants worth knowing

The original draft-and-verify scheme spawned a family of refinements:

  • Self-speculation, the target model drafts its own candidates using early layers or a lightweight head, removing the need for a separate draft model.
  • Multi-token prediction heads, extra output heads trained to predict several future tokens at once, used in some recent open-weight models.
  • N-gram and cache-based drafting, guessing the next tokens from recent context or a cache instead of running a model at all, which is nearly free when it works.

Here is how the main variants compare on what they need and where they shine:

VariantNeeds a second model?Best for
Classic draft-and-verifyYes, a small draft modelGeneral-purpose, well-matched model families
Self-speculationNo, uses early layers/headAvoiding a second model to manage
Multi-token prediction headsNo, trained into the modelModels built with MTP heads from the start
N-gram / cache draftingNo, no model at allRepetitive, boilerplate-heavy output

These approaches all chase the same goal: cheaper, more accurate guesses to push acceptance rates higher.

Turning it on

You rarely implement this yourself. The major serving frameworks ship it:

  • vLLM exposes speculative decoding through configuration, including n-gram and draft-model modes.
  • TensorRT-LLM and SGLang support it for high-throughput production serving.

Pairing it with the right serving engine matters; if you are still choosing one, vLLM vs Ollama vs llama.cpp walks through the trade-offs. And because speculative decoding interacts with model size, the compression techniques in LLM quantization stack neatly on top for even faster inference.

What to do to enable it

If you run your own inference and want the speedup, work through this in order:

  • Confirm your serving framework supports it: vLLM, SGLang, and TensorRT-LLM all do as of 2026.
  • For a quick, zero-extra-model win on repetitive output, try n-gram or prompt-lookup drafting first; it is nearly free.
  • For broader gains, pick a draft model from the same family, roughly an order of magnitude smaller than your target.
  • Measure your acceptance rate on real traffic, not a benchmark; that number predicts your actual speedup.
  • If acceptance is low, your workload may be too creative for a separate draft model; lean on self-speculation or multi-token heads instead.

A common mistake is reaching for too large a draft model. If the draft is half the size of the target, the draft passes themselves eat the savings even when acceptance is high. The sweet spot is a draft that is fast enough to be almost free yet accurate enough to clear the verification bar most of the time.

Where it does not help much

Speculative decoding speeds up the per-request latency of generating tokens, but it does not increase raw throughput on a fully saturated server the way batching does, and it adds memory overhead for the draft model. On heavily batched, throughput-bound workloads the win shrinks, because the target model's compute is already well utilized. The technique shines most on latency-sensitive, single-stream or lightly-batched serving, exactly where users are waiting on a streaming response.

Frequently asked questions

Does speculative decoding reduce output quality?

No. Accepted tokens are guaranteed to follow the target model's exact probability distribution. The output is mathematically identical to standard decoding, this is a provable property, not a heuristic trade-off.

How much faster is it really?

Typical production speedups land in the 2-4x range, driven by acceptance rates that often exceed 80%. The actual number depends on how predictable your workload is and how well-matched your draft and target models are.

Do I need a second model to use it?

Usually, but not always. Classic speculative decoding pairs a small draft model with a large target model. Self-speculation and multi-token-prediction variants let a single model draft for itself, avoiding the second model entirely.

Why isn't the speedup the same for every prompt?

Because it depends on acceptance rate. Repetitive or structured text is easy for the draft model to predict, so more tokens get accepted and you go faster. Highly creative or unusual text is harder to guess, so acceptance falls and the gain is smaller.

#ai#inference

Sources & further reading

Keep reading