FP8 Quantization
VRAM:~50% vs FP16
Speed:~2x on Hopper/Ada
TL;DR
8-bit floating point format with hardware acceleration on Hopper/Ada GPUs. Near-FP16 quality with 2x throughput.
Use when
- +Have H100 or RTX 40xx GPUs
- +Need near-FP16 quality
- +Maximum throughput required
Skip when
- -Older GPU architecture
- -Need 4-bit compression
FP8 (8-bit floating point) is a hardware-accelerated quantization format supported on NVIDIA Hopper (H100) and Ada Lovelace (RTX 40xx) GPUs. It provides significant speedup with minimal quality loss.
Formats
- **E4M3**: 4-bit exponent, 3-bit mantissa (higher precision, recommended for weights) - **E5M2**: 5-bit exponent, 2-bit mantissa (larger range, for activations)
Key Benefits
- **Hardware Acceleration**: Native Tensor Core support - **Quality**: Much better than INT8 for LLMs - **Performance**: ~2x throughput vs FP16
Code Examples
Enable FP8 in vLLMpython
from vllm import LLM
model = LLM(
model="meta-llama/Llama-2-7b-hf",
quantization="fp8"
) # Requires H100/RTX 40xx