Quantization
Reduce model precision to save VRAM and increase throughput
Quantization is the most impactful optimization for reducing LLM memory requirements. By converting model weights from 16-bit floating point to 8-bit or 4-bit integers, you can cut VRAM usage by 2-4x with minimal quality loss. Modern quantization methods like GPTQ, AWQ, and GGUF preserve model accuracy while dramatically reducing hardware costs. Whether you're deploying on consumer GPUs or scaling in the cloud, quantization makes previously impossible configurations viable.
GPTQ
Post-training quantization using approximate second-order information. Enables 4-bit inference with minimal quality loss.
AWQ
Activation-aware weight quantization that protects salient weights. Often faster than GPTQ with similar quality.
GGUF / llama.cpp Quantization
CPU-optimized quantization format with various precision options (Q4, Q5, Q6, Q8). Works on any hardware.
bitsandbytes (QLoRA)
Dynamic quantization for training and inference. Enables 4-bit QLoRA fine-tuning on consumer GPUs.
FP8 Quantization
8-bit floating point format with hardware acceleration on Hopper/Ada GPUs. Near-FP16 quality with 2x throughput.