Advanced7 min read

Speculative Decoding

A small draft model predicts several tokens ahead. The main model verifies them in one pass. When the draft is right, you skip ahead. When it is wrong, you only wasted one draft step.

How it works

Normal autoregressive decoding generates one token at a time. Each token requires a full forward pass through the model. For a 30B parameter model, that means reading ~30 GB of weights from memory per token. This is slow because inference is memory-bandwidth bound.

Speculative decoding breaks this bottleneck by using a small, fast draft model to predict multiple tokens ahead. The main model then verifies all of them in a single forward pass (which is nearly free — the bottleneck is bandwidth, and batch verification reads the weights only once).

// Step 1: Draft model predicts N tokens

draft = ["The", "quick", "brown", "fox"]

// Step 2: Main model verifies all at once

verify = ["The", "quick", "brown", "dog"]

// Step 3: Accept matching prefix, reject rest

accepted = ["The", "quick", "brown"] // 3 tokens in 1 step!

// Step 4: Resample "dog" from main model, continue

The key insight: verification is cheap (one forward pass regardless of how many tokens you check), and when the draft is correct, you skip multiple decoding steps. The output is mathematically identical to normal decoding — speculative decoding never changes the output distribution.

Eagle3: draft heads instead of draft models

Classic speculative decoding uses a separate small model as the draft. Eagle3 takes a different approach: it adds lightweight "draft heads" directly to the main model. These heads reuse the main model's hidden states and predict future tokens with minimal extra computation.

Advantages of draft heads

1.No separate model to load — the draft heads are tiny (a few MB) and share the main model's embeddings
2.Better acceptance rates — the draft heads see the main model's actual hidden states, not an approximation
3.Simpler deployment — one model, one process, no orchestration between draft and main

When it helps (and when it hurts)

Speculative decoding is not a universal win. Whether it helps depends entirely on your serving scenario.

HELPS

Single-stream / low concurrency

When you have one user waiting for a response, speculative decoding reduces their wall-clock time. The GPU is underutilized anyway, so the extra draft computation is free.

HELPS

Predictable output patterns

Code generation, structured output (JSON/YAML), and boilerplate text have high acceptance rates because the draft model can predict these patterns accurately.

HURTS

High concurrency / throughput-optimized

When serving many concurrent requests, the GPU is already fully utilized. Draft computation competes for the same memory bandwidth and compute, reducing total throughput. The per-request latency might be similar but you serve fewer total requests.

HURTS

Low acceptance rates

Creative writing, novel reasoning, or domains where the draft model cannot predict the main model's output. Every rejected draft is wasted computation.

Real example: Eagle3 on GB10

We benchmarked Qwen3-coder-30B with Eagle3 draft heads on NVIDIA GB10. The results show the classic speculative decoding tradeoff clearly:

Config	Single-stream	Parallel (8 req)
Baseline (no spec)	~25 tok/s	316 tok/s
Eagle3	35 tok/s (+40%)	270 tok/s (-15%)

The concurrency tradeoff

Eagle3 gives a massive +40% boost for single-stream use (like a developer waiting for code completion). But at 8 concurrent requests, total throughput drops from 316 to 270 tok/s. If you are running an API server with many users, you may want to disable speculative decoding and maximize aggregate throughput instead.

The acceptance rate metric

Acceptance rate is the percentage of draft tokens that the main model accepts. It is the single most important metric for speculative decoding performance.

80-90%+ acceptance: Excellent. Common for code generation, structured output, and repetitive patterns. You are getting 3-4 tokens per verification step.

60-80% acceptance: Good. Typical for general text. Still a net win for single-stream latency.

Below 60%: Marginal. The draft overhead may negate the speedup. Consider disabling speculative decoding for this workload.

In vLLM, you can monitor acceptance rate in the metrics output. If it consistently drops below 60% for your workload, switch to a baseline config without speculative decoding.

Configuring speculative decoding in vLLM

For Eagle3 draft heads, the vLLM configuration is straightforward:

# serving_card.yaml
speculative_config:
  method: eagle3
  model: yuhuili/Eagle3-Qwen3-30B-A3B
  num_speculative_tokens: 4

# num_speculative_tokens controls the draft length
# Higher = more aggressive speculation
# 4 is a good default; increase to 6-8 for code

Related serving cards

See speculative decoding benchmarks on real hardware:

Browse all configs