💾

KV-Cache Optimization

Techniques to reduce memory footprint of key-value caches

KV-cache stores attention keys and values from previous tokens, growing linearly with sequence length and batch size. For long-context applications, KV-cache can consume more VRAM than the model weights themselves. Techniques like Multi-Query Attention, Grouped-Query Attention, and KV-cache quantization dramatically reduce this overhead while maintaining generation quality.

3 Techniques