Sliding Window Attention

VRAM:Bounded regardless of context length

TL;DR

Limit attention to recent tokens only. Enables infinite context with fixed memory. Used in Mistral.

Use when

+Model uses sliding window (Mistral)

Skip when

-Architecture is fixed at training time

Sliding Window Attention restricts each token to attend only to the previous W tokens, where W is the window size. This bounds KV-cache memory regardless of sequence length.

How It Works

- Each layer attends to window of W tokens - With L layers and window W, effective context is L × W - Mistral: W=4096, L=32 → effective 128K context

Key Benefits

- **Memory**: Fixed KV-cache size - **Infinite Context**: No memory limit on sequence length - **Speed**: Faster attention for long sequences

References

📄
Mistral 7B Paper
Paper
📄
Longformer: The Long-Document Transformer
Paper