Sliding Window Attention
VRAM:Bounded regardless of context length
TL;DR
Limit attention to recent tokens only. Enables infinite context with fixed memory. Used in Mistral.
Use when
- +Model uses sliding window (Mistral)
Skip when
- -Architecture is fixed at training time
Sliding Window Attention restricts each token to attend only to the previous W tokens, where W is the window size. This bounds KV-cache memory regardless of sequence length.
How It Works
- Each layer attends to window of W tokens - With L layers and window W, effective context is L × W - Mistral: W=4096, L=32 → effective 128K context
Key Benefits
- **Memory**: Fixed KV-cache size - **Infinite Context**: No memory limit on sequence length - **Speed**: Faster attention for long sequences