Attention Mechanisms
Efficient attention implementations for faster inference
Attention computation dominates LLM inference time, especially for long sequences. Optimized attention mechanisms like Flash Attention 2 and Paged Attention can deliver 2-4x speedups by using memory-efficient algorithms and GPU-optimized kernels. These techniques reduce memory bandwidth bottlenecks and enable longer context windows without proportionally increasing latency.
Flash Attention 2
Memory-efficient attention with IO-aware tiling. 2-4x faster than standard attention, O(1) memory overhead.
Paged Attention
Virtual memory for KV-cache. Enables efficient batching and eliminates memory fragmentation. Core innovation in vLLM.
Multi-Query Attention (MQA)
Single KV head shared across all query heads. Reduces KV-cache by 8-32x. Used in Falcon, PaLM.
Grouped-Query Attention (GQA)
Compromise between MHA and MQA. Groups of query heads share KV heads. Used in Llama 2 70B, Mistral.