Prefix Caching
Speed:2-10x for repeated prefixes
TL;DR
Cache and reuse KV-cache for shared prompt prefixes. Eliminates redundant computation for repeated context.
Use when
- +RAG pipelines
- +Shared system prompts
- +Multi-turn conversations
Skip when
- -Every request has unique context
- -Single-turn inference
Prefix caching stores the KV-cache for common prompt prefixes and reuses it across requests. This is particularly effective for RAG and system prompts.
Use Cases
- **System Prompts**: Cache the system message - **RAG**: Cache retrieved documents - **Few-shot**: Cache example prompts
Key Benefits
- **Latency**: Skip recomputing prefix - **Throughput**: More capacity for unique content - **Cost**: Reduce compute per request
Code Examples
Enable prefix caching in vLLMpython
from vllm import LLM
model = LLM(
model="meta-llama/Llama-2-7b-hf",
enable_prefix_caching=True
)