Tensor Parallelism
VRAM:Linear with GPU count (2 GPUs = 50% each)
TL;DR
Split layers across GPUs. Linear scaling for models too large for single GPU.
Use when
- +Model doesn't fit in single GPU
- +Have multiple GPUs with fast interconnect
Skip when
- -Model fits in one GPU
- -GPUs have slow interconnect
Tensor Parallelism (TP) splits individual layers across multiple GPUs, allowing models larger than single-GPU memory to run.
How It Works
- Weight matrices are split column-wise or row-wise - Each GPU computes its partition - Results are combined via all-reduce
Key Benefits
- **Memory**: Linear reduction with GPU count - **Latency**: Lower than pipeline parallelism - **Utilization**: All GPUs active simultaneously
Code Examples
Enable tensor parallelism in vLLMpython
from vllm import LLM
# Llama 70B across 4 GPUs
model = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4
)