FP8 Quantization

VRAM:~50% vs FP16
Speed:~2x on Hopper/Ada

TL;DR

8-bit floating point format with hardware acceleration on Hopper/Ada GPUs. Near-FP16 quality with 2x throughput.

Use when

  • +Have H100 or RTX 40xx GPUs
  • +Need near-FP16 quality
  • +Maximum throughput required

Skip when

  • -Older GPU architecture
  • -Need 4-bit compression

FP8 (8-bit floating point) is a hardware-accelerated quantization format supported on NVIDIA Hopper (H100) and Ada Lovelace (RTX 40xx) GPUs. It provides significant speedup with minimal quality loss.

Formats

- **E4M3**: 4-bit exponent, 3-bit mantissa (higher precision, recommended for weights) - **E5M2**: 5-bit exponent, 2-bit mantissa (larger range, for activations)

Key Benefits

- **Hardware Acceleration**: Native Tensor Core support - **Quality**: Much better than INT8 for LLMs - **Performance**: ~2x throughput vs FP16

Code Examples

Enable FP8 in vLLMpython
from vllm import LLM

model = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="fp8"
)  # Requires H100/RTX 40xx