KV-Cache Quantization

VRAM:~50% for KV-cache

TL;DR

Quantize cached keys/values to FP8 or INT8. Reduces KV memory by 50% with minimal quality impact.

Use when

  • +Long context inference
  • +Memory-limited deployment
  • +Using vLLM

Skip when

  • -Maximum quality required
  • -Very short contexts

KV-cache quantization reduces the memory footprint of cached key-value pairs by storing them in lower precision formats.

Formats

- **FP8**: Recommended, minimal quality loss - **INT8**: More compression, slightly more quality loss

Key Benefits

- **Memory**: 50% reduction in KV-cache - **Throughput**: Larger batches possible - **Quality**: <1% impact on most benchmarks

Code Examples

Enable KV-cache quantization in vLLMpython
from vllm import LLM

model = LLM(
    model="meta-llama/Llama-2-7b-hf",
    kv_cache_dtype="fp8"  # or "int8"
)