💾

KV-Cache Optimization

Techniques to reduce memory footprint of key-value caches

KV-cache stores attention keys and values from previous tokens, growing linearly with sequence length and batch size. For long-context applications, KV-cache can consume more VRAM than the model weights themselves. Techniques like Multi-Query Attention, Grouped-Query Attention, and KV-cache quantization dramatically reduce this overhead while maintaining generation quality.

3 Techniques

KV-Cache Quantization

Quantize cached keys/values to FP8 or INT8. Reduces KV memory by 50% with minimal quality impact.

VRAM:~50% for KV-cache

NVIDIA

Prefix Caching

Cache and reuse KV-cache for shared prompt prefixes. Eliminates redundant computation for repeated context.

Speed:2-10x for repeated prefixes

NVIDIAAMD

Sliding Window Attention

Limit attention to recent tokens only. Enables infinite context with fixed memory. Used in Mistral.

VRAM:Bounded regardless of context length

NVIDIAAMDApple+1