🧠

Memory Management

Distribute and offload model weights across devices

Large models often exceed single-GPU memory capacity. Tensor parallelism splits model layers across multiple GPUs for faster inference. Pipeline parallelism enables serving on smaller GPUs by processing layers sequentially. CPU offloading lets you run models larger than your total GPU VRAM by streaming weights on-demand.

3 Techniques

Tensor Parallelism

Split layers across GPUs. Linear scaling for models too large for single GPU.

VRAM:Linear with GPU count (2 GPUs = 50% each)

NVIDIAAMD

Pipeline Parallelism

Split model by layers across GPUs. Better for slow interconnects than tensor parallelism.

NVIDIAAMD

CPU Offloading

Store weights in RAM, load to GPU layer-by-layer. Run large models on limited VRAM.

NVIDIAAMDApple