📊

Quantization

Reduce model precision to save VRAM and increase throughput

Quantization is the most impactful optimization for reducing LLM memory requirements. By converting model weights from 16-bit floating point to 8-bit or 4-bit integers, you can cut VRAM usage by 2-4x with minimal quality loss. Modern quantization methods like GPTQ, AWQ, and GGUF preserve model accuracy while dramatically reducing hardware costs. Whether you're deploying on consumer GPUs or scaling in the cloud, quantization makes previously impossible configurations viable.

5 Techniques

GPTQ

Post-training quantization using approximate second-order information. Enables 4-bit inference with minimal quality loss.

VRAM:~75% vs FP16

Speed:1.5-3x (memory-bound)

NVIDIAAMD

AWQ

Activation-aware weight quantization that protects salient weights. Often faster than GPTQ with similar quality.

VRAM:~75% vs FP16

Speed:1.5-4x (vs FP16)

NVIDIAAMD

GGUF / llama.cpp Quantization

CPU-optimized quantization format with various precision options (Q4, Q5, Q6, Q8). Works on any hardware.

VRAM:50-87% depending on quant level

NVIDIAAMDApple+1

bitsandbytes (QLoRA)

Dynamic quantization for training and inference. Enables 4-bit QLoRA fine-tuning on consumer GPUs.

VRAM:~75% with 4-bit

NVIDIA

FP8 Quantization

8-bit floating point format with hardware acceleration on Hopper/Ada GPUs. Near-FP16 quality with 2x throughput.

VRAM:~50% vs FP16

Speed:~2x on Hopper/Ada

NVIDIA