Pipeline Parallelism
TL;DR
Split model by layers across GPUs. Better for slow interconnects than tensor parallelism.
Use when
- +Slow GPU interconnect (PCIe)
- +Very large models
- +Training (less common for inference)
Skip when
- -Have NVLink/NVSwitch
- -Can use tensor parallelism
Pipeline Parallelism (PP) assigns different layers to different GPUs. Activations are passed sequentially between GPUs.
How It Works
- Model split into stages by layer groups - Each GPU holds consecutive layers - Activations passed between GPUs sequentially
Trade-offs
- **Pro**: Less communication than tensor parallelism - **Con**: Pipeline bubbles reduce utilization - **Best for**: Slow GPU interconnects