Tensor Parallelism

VRAM:Linear with GPU count (2 GPUs = 50% each)

TL;DR

Split layers across GPUs. Linear scaling for models too large for single GPU.

Use when

+Model doesn't fit in single GPU
+Have multiple GPUs with fast interconnect

Skip when

-Model fits in one GPU
-GPUs have slow interconnect

Tensor Parallelism (TP) splits individual layers across multiple GPUs, allowing models larger than single-GPU memory to run.

How It Works

- Weight matrices are split column-wise or row-wise - Each GPU computes its partition - Results are combined via all-reduce

Key Benefits

- **Memory**: Linear reduction with GPU count - **Latency**: Lower than pipeline parallelism - **Utilization**: All GPUs active simultaneously

Code Examples

Enable tensor parallelism in vLLMpython

from vllm import LLM

# Llama 70B across 4 GPUs
model = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4
)

References