Grouped-Query Attention (GQA)

VRAM:4-8x smaller KV-cache vs MHA

TL;DR

Compromise between MHA and MQA. Groups of query heads share KV heads. Used in Llama 2 70B, Mistral.

Use when

+Models already use GQA (Llama 2 70B, Mistral)

Skip when

-Architecture is fixed at training time

Grouped-Query Attention divides query heads into groups, with each group sharing a single KV head. It balances MHA quality with MQA efficiency.

How It Works

For 32 query heads with 8 groups: - 4 query heads share each KV head - Total: 8 KV heads instead of 32

Key Benefits

- **Quality**: Better than MQA, close to MHA - **Efficiency**: 4-8x smaller KV-cache than MHA - **Adoption**: Standard in modern LLMs

References

📄
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper