🧠
Memory Management
Distribute and offload model weights across devices
Large models often exceed single-GPU memory capacity. Tensor parallelism splits model layers across multiple GPUs for faster inference. Pipeline parallelism enables serving on smaller GPUs by processing layers sequentially. CPU offloading lets you run models larger than your total GPU VRAM by streaming weights on-demand.
3 Techniques
Tensor Parallelism
Split layers across GPUs. Linear scaling for models too large for single GPU.
VRAM:Linear with GPU count (2 GPUs = 50% each)
NVIDIAAMD
Pipeline Parallelism
Split model by layers across GPUs. Better for slow interconnects than tensor parallelism.
NVIDIAAMD
CPU Offloading
Store weights in RAM, load to GPU layer-by-layer. Run large models on limited VRAM.
NVIDIAAMDApple