Speculative Decoding

Speed:2-3x for autoregressive generation

TL;DR

Use small draft model to predict tokens, verify with large model. 2-3x faster with identical outputs.

Use when

+Latency-sensitive applications
+Large model inference
+Have VRAM for draft model

Skip when

-Already memory-constrained
-High batch sizes (less benefit)

Speculative decoding uses a small, fast model to generate draft tokens, then verifies them with the large model in a single forward pass. Accepted tokens are free; rejected ones are regenerated.

How It Works

1. Draft model generates K tokens quickly 2. Target model verifies all K tokens in one pass 3. Accept matching tokens, regenerate from first mismatch 4. Repeat

Key Benefits

- **Speed**: 2-3x faster generation - **Quality**: Mathematically identical outputs - **Memory**: Small draft model overhead

Code Examples

Speculative decoding with vLLMpython

from vllm import LLM

model = LLM(
    model="meta-llama/Llama-2-70b-hf",
    speculative_model="meta-llama/Llama-2-7b-hf",
    num_speculative_tokens=5
)