Prefix Caching

Speed:2-10x for repeated prefixes

TL;DR

Cache and reuse KV-cache for shared prompt prefixes. Eliminates redundant computation for repeated context.

Use when

+RAG pipelines
+Shared system prompts
+Multi-turn conversations

Skip when

-Every request has unique context
-Single-turn inference

Prefix caching stores the KV-cache for common prompt prefixes and reuses it across requests. This is particularly effective for RAG and system prompts.

Use Cases

- **System Prompts**: Cache the system message - **RAG**: Cache retrieved documents - **Few-shot**: Cache example prompts

Key Benefits

- **Latency**: Skip recomputing prefix - **Throughput**: More capacity for unique content - **Cost**: Reduce compute per request

Code Examples

Enable prefix caching in vLLMpython

from vllm import LLM

model = LLM(
    model="meta-llama/Llama-2-7b-hf",
    enable_prefix_caching=True
)

References

📖
vLLM Automatic Prefix Caching
Documentation
📄
SGLang RadixAttention
Paper