CPU Offloading

TL;DR

Store weights in RAM, load to GPU layer-by-layer. Run large models on limited VRAM.

Use when

+Model larger than VRAM
+Testing large models
+Non-production use

Skip when

-Production latency requirements
-Have sufficient VRAM

CPU offloading stores model weights in system RAM and transfers them to GPU as needed for computation. This enables running models larger than GPU memory.

How It Works

- Weights stored in pinned CPU memory - Transferred to GPU layer-by-layer during inference - Overlapped with computation when possible

Trade-offs

- **Pro**: Run any model size with enough RAM - **Con**: Significant speed reduction - **Best for**: Testing, development, memory-constrained deployment

Code Examples

CPU offloading with llama.cppbash

# Offload 20 layers to CPU
./llama-cli -m model.gguf \
  --n-gpu-layers 12 \  # Keep 12 on GPU
  -p "Hello"

References

💻
llama.cpp GitHub
Repository
📄
DeepSpeed ZeRO-Offload
Paper