Prefix Caching

Speed:2-10x for repeated prefixes

TL;DR

Cache and reuse KV-cache for shared prompt prefixes. Eliminates redundant computation for repeated context.

Use when

  • +RAG pipelines
  • +Shared system prompts
  • +Multi-turn conversations

Skip when

  • -Every request has unique context
  • -Single-turn inference

Prefix caching stores the KV-cache for common prompt prefixes and reuses it across requests. This is particularly effective for RAG and system prompts.

Use Cases

- **System Prompts**: Cache the system message - **RAG**: Cache retrieved documents - **Few-shot**: Cache example prompts

Key Benefits

- **Latency**: Skip recomputing prefix - **Throughput**: More capacity for unique content - **Cost**: Reduce compute per request

Code Examples

Enable prefix caching in vLLMpython
from vllm import LLM

model = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_prefix_caching=True
)