GPTQ

VRAM:~75% vs FP16
Speed:1.5-3x (memory-bound)
Quality:<1% perplexity increase

TL;DR

Post-training quantization using approximate second-order information. Enables 4-bit inference with minimal quality loss.

Use when

  • +Running on NVIDIA GPUs
  • +Need 4-bit quantization
  • +Using vLLM or ExLlama

Skip when

  • -Need CPU inference
  • -Model not available in GPTQ format

GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot weight quantization method based on approximate second-order information. It quantizes weights to 3-4 bits while maintaining accuracy close to the full-precision model.

How It Works

GPTQ quantizes weights layer-by-layer using the inverse Hessian matrix to determine optimal quantization. It processes layers in order, updating remaining weights to compensate for quantization error.

Key Benefits

- **Memory Reduction**: 4-bit weights reduce model size by ~75% compared to FP16 - **Speed**: Faster inference due to reduced memory bandwidth - **Quality**: Maintains perplexity close to FP16 baseline

Code Examples

Load GPTQ model with vLLMpython
from vllm import LLM

model = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="float16"
)