GGUF / llama.cpp Quantization
VRAM:50-87% depending on quant level
TL;DR
CPU-optimized quantization format with various precision options (Q4, Q5, Q6, Q8). Works on any hardware.
Use when
- +Running on CPU
- +Need Apple Silicon support
- +Using Ollama or llama.cpp
Skip when
- -Need maximum GPU throughput
- -Using vLLM in production
GGUF is the successor to GGML, providing a flexible file format for quantized models optimized for llama.cpp. Supports multiple quantization levels from 2-bit to 8-bit.
Quantization Levels
- **Q2_K, Q3_K**: Extreme compression, noticeable quality loss - **Q4_K_M**: Good balance of size and quality (recommended) - **Q5_K_M**: Higher quality, larger size - **Q6_K, Q8_0**: Near-original quality
Key Benefits
- **Universal**: Works on CPU, NVIDIA, AMD, Apple Silicon - **Flexible**: Choose quality/size tradeoff - **No GPU Required**: Full CPU inference support
Code Examples
Run GGUF with llama.cppbash
./llama-cli -m llama-2-7b.Q4_K_M.gguf \
-p "Hello, world" \
-n 128 --ctx-size 4096