AWQ
VRAM:~75% vs FP16
Speed:1.5-4x (vs FP16)
TL;DR
Activation-aware weight quantization that protects salient weights. Often faster than GPTQ with similar quality.
Use when
- +Need fast 4-bit inference on NVIDIA
- +Using vLLM or TensorRT-LLM
Skip when
- -Need CPU inference
- -Running on Apple Silicon
AWQ (Activation-aware Weight Quantization) observes activation patterns during calibration to identify and protect important weights. It achieves 4-bit quantization with better inference speed than GPTQ.
How It Works
AWQ identifies salient weight channels by analyzing activation magnitudes during calibration. Important weights are scaled to reduce quantization error, then the model is quantized to 4 bits.
Key Benefits
- **Speed**: Typically faster than GPTQ due to simpler dequantization - **Quality**: Comparable or better than GPTQ for most models - **Compatibility**: Wide framework support
Code Examples
Load AWQ model with vLLMpython
from vllm import LLM
model = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq"
)