💾
KV-Cache Optimization
Techniques to reduce memory footprint of key-value caches
KV-cache stores attention keys and values from previous tokens, growing linearly with sequence length and batch size. For long-context applications, KV-cache can consume more VRAM than the model weights themselves. Techniques like Multi-Query Attention, Grouped-Query Attention, and KV-cache quantization dramatically reduce this overhead while maintaining generation quality.
3 Techniques
KV-Cache Quantization
Quantize cached keys/values to FP8 or INT8. Reduces KV memory by 50% with minimal quality impact.
VRAM:~50% for KV-cache
NVIDIA
Prefix Caching
Cache and reuse KV-cache for shared prompt prefixes. Eliminates redundant computation for repeated context.
Speed:2-10x for repeated prefixes
NVIDIAAMD
Sliding Window Attention
Limit attention to recent tokens only. Enables infinite context with fixed memory. Used in Mistral.
VRAM:Bounded regardless of context length
NVIDIAAMDApple+1