Speculative Decoding
A small draft model predicts several tokens ahead. The main model verifies them in one pass. When the draft is right, you skip ahead. When it is wrong, you only wasted one draft step.
How it works
Normal autoregressive decoding generates one token at a time. Each token requires a full forward pass through the model. For a 30B parameter model, that means reading ~30 GB of weights from memory per token. This is slow because inference is memory-bandwidth bound.
Speculative decoding breaks this bottleneck by using a small, fast draft model to predict multiple tokens ahead. The main model then verifies all of them in a single forward pass (which is nearly free — the bottleneck is bandwidth, and batch verification reads the weights only once).
// Step 1: Draft model predicts N tokens
draft = ["The", "quick", "brown", "fox"]
// Step 2: Main model verifies all at once
verify = ["The", "quick", "brown", "dog"]
// Step 3: Accept matching prefix, reject rest
accepted = ["The", "quick", "brown"] // 3 tokens in 1 step!
// Step 4: Resample "dog" from main model, continue
The key insight: verification is cheap (one forward pass regardless of how many tokens you check), and when the draft is correct, you skip multiple decoding steps. The output is mathematically identical to normal decoding — speculative decoding never changes the output distribution.
Eagle3: draft heads instead of draft models
Classic speculative decoding uses a separate small model as the draft. Eagle3 takes a different approach: it adds lightweight "draft heads" directly to the main model. These heads reuse the main model's hidden states and predict future tokens with minimal extra computation.
Advantages of draft heads
- 1.No separate model to load — the draft heads are tiny (a few MB) and share the main model's embeddings
- 2.Better acceptance rates — the draft heads see the main model's actual hidden states, not an approximation
- 3.Simpler deployment — one model, one process, no orchestration between draft and main
When it helps (and when it hurts)
Speculative decoding is not a universal win. Whether it helps depends entirely on your serving scenario.
Single-stream / low concurrency
When you have one user waiting for a response, speculative decoding reduces their wall-clock time. The GPU is underutilized anyway, so the extra draft computation is free.
Predictable output patterns
Code generation, structured output (JSON/YAML), and boilerplate text have high acceptance rates because the draft model can predict these patterns accurately.
High concurrency / throughput-optimized
When serving many concurrent requests, the GPU is already fully utilized. Draft computation competes for the same memory bandwidth and compute, reducing total throughput. The per-request latency might be similar but you serve fewer total requests.
Low acceptance rates
Creative writing, novel reasoning, or domains where the draft model cannot predict the main model's output. Every rejected draft is wasted computation.
Real example: Eagle3 on GB10
We benchmarked Qwen3-coder-30B with Eagle3 draft heads on NVIDIA GB10. The results show the classic speculative decoding tradeoff clearly:
| Config | Single-stream | Parallel (8 req) |
|---|---|---|
| Baseline (no spec) | ~25 tok/s | 316 tok/s |
| Eagle3 | 35 tok/s (+40%) | 270 tok/s (-15%) |
The concurrency tradeoff
Eagle3 gives a massive +40% boost for single-stream use (like a developer waiting for code completion). But at 8 concurrent requests, total throughput drops from 316 to 270 tok/s. If you are running an API server with many users, you may want to disable speculative decoding and maximize aggregate throughput instead.
The acceptance rate metric
Acceptance rate is the percentage of draft tokens that the main model accepts. It is the single most important metric for speculative decoding performance.
80-90%+ acceptance: Excellent. Common for code generation, structured output, and repetitive patterns. You are getting 3-4 tokens per verification step.
60-80% acceptance: Good. Typical for general text. Still a net win for single-stream latency.
Below 60%: Marginal. The draft overhead may negate the speedup. Consider disabling speculative decoding for this workload.
In vLLM, you can monitor acceptance rate in the metrics output. If it consistently drops below 60% for your workload, switch to a baseline config without speculative decoding.
Configuring speculative decoding in vLLM
For Eagle3 draft heads, the vLLM configuration is straightforward:
# serving_card.yaml speculative_config: method: eagle3 model: yuhuili/Eagle3-Qwen3-30B-A3B num_speculative_tokens: 4 # num_speculative_tokens controls the draft length # Higher = more aggressive speculation # 4 is a good default; increase to 6-8 for code
Related serving cards
See speculative decoding benchmarks on real hardware: