Paged Attention
TL;DR
Virtual memory for KV-cache. Enables efficient batching and eliminates memory fragmentation. Core innovation in vLLM.
Use when
- +High-throughput serving
- +Variable-length batches
- +Using vLLM
Skip when
- -Single request at a time
- -Not using vLLM-based serving
Paged Attention manages KV-cache memory like an operating system manages virtual memory. It eliminates fragmentation and enables efficient memory sharing across requests.
How It Works
Instead of contiguous KV-cache allocation, Paged Attention stores cache in fixed-size blocks. A block table maps logical cache positions to physical memory, enabling: - **No Fragmentation**: Only allocate what's needed - **Memory Sharing**: Reuse blocks for common prefixes - **Dynamic Allocation**: Grow cache as generation proceeds
Key Benefits
- **Throughput**: 2-4x higher batch sizes - **Efficiency**: Near-zero memory waste - **Flexibility**: Variable sequence lengths
Code Examples
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
# Paged Attention enabled automatically
outputs = llm.generate(
["Hello, ", "How are "],
SamplingParams(max_tokens=100)
)