KV-Cache Quantization

VRAM:~50% for KV-cache

TL;DR

Quantize cached keys/values to FP8 or INT8. Reduces KV memory by 50% with minimal quality impact.

Use when

+Long context inference
+Memory-limited deployment
+Using vLLM

Skip when

-Maximum quality required
-Very short contexts

KV-cache quantization reduces the memory footprint of cached key-value pairs by storing them in lower precision formats.

Formats

- **FP8**: Recommended, minimal quality loss - **INT8**: More compression, slightly more quality loss

Key Benefits

- **Memory**: 50% reduction in KV-cache - **Throughput**: Larger batches possible - **Quality**: <1% impact on most benchmarks

Code Examples

Enable KV-cache quantization in vLLMpython

from vllm import LLM

model = LLM(
    model="meta-llama/Llama-2-7b-hf",
    kv_cache_dtype="fp8"  # or "int8"
)

References

📖
vLLM KV Cache Documentation
Documentation