🧠

Memory Management

Distribute and offload model weights across devices

Large models often exceed single-GPU memory capacity. Tensor parallelism splits model layers across multiple GPUs for faster inference. Pipeline parallelism enables serving on smaller GPUs by processing layers sequentially. CPU offloading lets you run models larger than your total GPU VRAM by streaming weights on-demand.

3 Techniques