AWQ

VRAM:~75% vs FP16
Speed:1.5-4x (vs FP16)

TL;DR

Activation-aware weight quantization that protects salient weights. Often faster than GPTQ with similar quality.

Use when

  • +Need fast 4-bit inference on NVIDIA
  • +Using vLLM or TensorRT-LLM

Skip when

  • -Need CPU inference
  • -Running on Apple Silicon

AWQ (Activation-aware Weight Quantization) observes activation patterns during calibration to identify and protect important weights. It achieves 4-bit quantization with better inference speed than GPTQ.

How It Works

AWQ identifies salient weight channels by analyzing activation magnitudes during calibration. Important weights are scaled to reduce quantization error, then the model is quantized to 4 bits.

Key Benefits

- **Speed**: Typically faster than GPTQ due to simpler dequantization - **Quality**: Comparable or better than GPTQ for most models - **Compatibility**: Wide framework support

Code Examples

Load AWQ model with vLLMpython
from vllm import LLM

model = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq"
)