CPU Offloading
TL;DR
Store weights in RAM, load to GPU layer-by-layer. Run large models on limited VRAM.
Use when
- +Model larger than VRAM
- +Testing large models
- +Non-production use
Skip when
- -Production latency requirements
- -Have sufficient VRAM
CPU offloading stores model weights in system RAM and transfers them to GPU as needed for computation. This enables running models larger than GPU memory.
How It Works
- Weights stored in pinned CPU memory - Transferred to GPU layer-by-layer during inference - Overlapped with computation when possible
Trade-offs
- **Pro**: Run any model size with enough RAM - **Con**: Significant speed reduction - **Best for**: Testing, development, memory-constrained deployment
Code Examples
CPU offloading with llama.cppbash
# Offload 20 layers to CPU
./llama-cli -m model.gguf \
--n-gpu-layers 12 \ # Keep 12 on GPU
-p "Hello"