Pipeline Parallelism

TL;DR

Split model by layers across GPUs. Better for slow interconnects than tensor parallelism.

Use when

  • +Slow GPU interconnect (PCIe)
  • +Very large models
  • +Training (less common for inference)

Skip when

  • -Have NVLink/NVSwitch
  • -Can use tensor parallelism

Pipeline Parallelism (PP) assigns different layers to different GPUs. Activations are passed sequentially between GPUs.

How It Works

- Model split into stages by layer groups - Each GPU holds consecutive layers - Activations passed between GPUs sequentially

Trade-offs

- **Pro**: Less communication than tensor parallelism - **Con**: Pipeline bubbles reduce utilization - **Best for**: Slow GPU interconnects