Chunked Prefill

Speed:2-5x latency reduction for short requests

TL;DR

Split long prompts into chunks and interleave with decode steps. Reduces latency for short requests in mixed workloads.

Use when

  • +Mixed long/short request workloads
  • +Latency-sensitive applications
  • +Using vLLM or SGLang

Skip when

  • -Uniform request lengths
  • -Throughput-only optimization

Chunked Prefill breaks long prompt processing into smaller chunks, interleaving them with decode steps from other requests. This prevents long prompts from blocking short requests.

How It Works

Instead of processing an entire prompt at once (which can take seconds for long contexts), the prefill is split into fixed-size chunks: 1. Process chunk of long prompt 2. Run decode steps for other requests 3. Process next chunk 4. Repeat

Key Benefits

- **Fairness**: Short requests don't wait for long prompts - **Latency**: More predictable response times - **Throughput**: Better GPU utilization in mixed workloads

Code Examples

Enable chunked prefill in vLLMpython
from vllm import LLM

model = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_chunked_prefill=True,
    max_num_batched_tokens=2048  # chunk size
)