⚡

Attention Mechanisms

Efficient attention implementations for faster inference

Attention computation dominates LLM inference time, especially for long sequences. Optimized attention mechanisms like Flash Attention 2 and Paged Attention can deliver 2-4x speedups by using memory-efficient algorithms and GPU-optimized kernels. These techniques reduce memory bandwidth bottlenecks and enable longer context windows without proportionally increasing latency.

4 Techniques

Flash Attention 2

Memory-efficient attention with IO-aware tiling. 2-4x faster than standard attention, O(1) memory overhead.

Speed:2-4x vs standard attention

NVIDIAAMD

Paged Attention

Virtual memory for KV-cache. Enables efficient batching and eliminates memory fragmentation. Core innovation in vLLM.

Speed:2-4x throughput vs naive batching

NVIDIAAMD

Multi-Query Attention (MQA)

Single KV head shared across all query heads. Reduces KV-cache by 8-32x. Used in Falcon, PaLM.

VRAM:8-32x smaller KV-cache

NVIDIAAMDApple+1

Grouped-Query Attention (GQA)

Compromise between MHA and MQA. Groups of query heads share KV heads. Used in Llama 2 70B, Mistral.

VRAM:4-8x smaller KV-cache vs MHA

NVIDIAAMDApple+1