Grouped-Query Attention (GQA)
VRAM:4-8x smaller KV-cache vs MHA
TL;DR
Compromise between MHA and MQA. Groups of query heads share KV heads. Used in Llama 2 70B, Mistral.
Use when
- +Models already use GQA (Llama 2 70B, Mistral)
Skip when
- -Architecture is fixed at training time
Grouped-Query Attention divides query heads into groups, with each group sharing a single KV head. It balances MHA quality with MQA efficiency.
How It Works
For 32 query heads with 8 groups: - 4 query heads share each KV head - Total: 8 KV heads instead of 32
Key Benefits
- **Quality**: Better than MQA, close to MHA - **Efficiency**: 4-8x smaller KV-cache than MHA - **Adoption**: Standard in modern LLMs