GGUF / llama.cpp Quantization

VRAM:50-87% depending on quant level

TL;DR

CPU-optimized quantization format with various precision options (Q4, Q5, Q6, Q8). Works on any hardware.

Use when

+Running on CPU
+Need Apple Silicon support
+Using Ollama or llama.cpp

Skip when

-Need maximum GPU throughput
-Using vLLM in production

GGUF is the successor to GGML, providing a flexible file format for quantized models optimized for llama.cpp. Supports multiple quantization levels from 2-bit to 8-bit.

Quantization Levels

- **Q2_K, Q3_K**: Extreme compression, noticeable quality loss - **Q4_K_M**: Good balance of size and quality (recommended) - **Q5_K_M**: Higher quality, larger size - **Q6_K, Q8_0**: Near-original quality

Key Benefits

- **Universal**: Works on CPU, NVIDIA, AMD, Apple Silicon - **Flexible**: Choose quality/size tradeoff - **No GPU Required**: Full CPU inference support

Code Examples

Run GGUF with llama.cppbash

./llama-cli -m llama-2-7b.Q4_K_M.gguf \
  -p "Hello, world" \
  -n 128 --ctx-size 4096