Paged Attention

Speed:2-4x throughput vs naive batching

TL;DR

Virtual memory for KV-cache. Enables efficient batching and eliminates memory fragmentation. Core innovation in vLLM.

Use when

  • +High-throughput serving
  • +Variable-length batches
  • +Using vLLM

Skip when

  • -Single request at a time
  • -Not using vLLM-based serving

Paged Attention manages KV-cache memory like an operating system manages virtual memory. It eliminates fragmentation and enables efficient memory sharing across requests.

How It Works

Instead of contiguous KV-cache allocation, Paged Attention stores cache in fixed-size blocks. A block table maps logical cache positions to physical memory, enabling: - **No Fragmentation**: Only allocate what's needed - **Memory Sharing**: Reuse blocks for common prefixes - **Dynamic Allocation**: Grow cache as generation proceeds

Key Benefits

- **Throughput**: 2-4x higher batch sizes - **Efficiency**: Near-zero memory waste - **Flexibility**: Variable sequence lengths

Code Examples

vLLM uses Paged Attention by defaultpython
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
# Paged Attention enabled automatically

outputs = llm.generate(
    ["Hello, ", "How are "],
    SamplingParams(max_tokens=100)
)