Speculative Decoding
Speed:2-3x for autoregressive generation
TL;DR
Use small draft model to predict tokens, verify with large model. 2-3x faster with identical outputs.
Use when
- +Latency-sensitive applications
- +Large model inference
- +Have VRAM for draft model
Skip when
- -Already memory-constrained
- -High batch sizes (less benefit)
Speculative decoding uses a small, fast model to generate draft tokens, then verifies them with the large model in a single forward pass. Accepted tokens are free; rejected ones are regenerated.
How It Works
1. Draft model generates K tokens quickly 2. Target model verifies all K tokens in one pass 3. Accept matching tokens, regenerate from first mismatch 4. Repeat
Key Benefits
- **Speed**: 2-3x faster generation - **Quality**: Mathematically identical outputs - **Memory**: Small draft model overhead
Code Examples
Speculative decoding with vLLMpython
from vllm import LLM
model = LLM(
model="meta-llama/Llama-2-70b-hf",
speculative_model="meta-llama/Llama-2-7b-hf",
num_speculative_tokens=5
)