KV-Cache Quantization
VRAM:~50% for KV-cache
TL;DR
Quantize cached keys/values to FP8 or INT8. Reduces KV memory by 50% with minimal quality impact.
Use when
- +Long context inference
- +Memory-limited deployment
- +Using vLLM
Skip when
- -Maximum quality required
- -Very short contexts
KV-cache quantization reduces the memory footprint of cached key-value pairs by storing them in lower precision formats.
Formats
- **FP8**: Recommended, minimal quality loss - **INT8**: More compression, slightly more quality loss
Key Benefits
- **Memory**: 50% reduction in KV-cache - **Throughput**: Larger batches possible - **Quality**: <1% impact on most benchmarks
Code Examples
Enable KV-cache quantization in vLLMpython
from vllm import LLM
model = LLM(
model="meta-llama/Llama-2-7b-hf",
kv_cache_dtype="fp8" # or "int8"
)