GPTQ
TL;DR
Post-training quantization using approximate second-order information. Enables 4-bit inference with minimal quality loss.
Use when
- +Running on NVIDIA GPUs
- +Need 4-bit quantization
- +Using vLLM or ExLlama
Skip when
- -Need CPU inference
- -Model not available in GPTQ format
GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot weight quantization method based on approximate second-order information. It quantizes weights to 3-4 bits while maintaining accuracy close to the full-precision model.
How It Works
GPTQ quantizes weights layer-by-layer using the inverse Hessian matrix to determine optimal quantization. It processes layers in order, updating remaining weights to compensate for quantization error.
Key Benefits
- **Memory Reduction**: 4-bit weights reduce model size by ~75% compared to FP16 - **Speed**: Faster inference due to reduced memory bandwidth - **Quality**: Maintains perplexity close to FP16 baseline
Code Examples
from vllm import LLM
model = LLM(
model="TheBloke/Llama-2-7B-GPTQ",
quantization="gptq",
dtype="float16"
)