llama.cpp
TL;DR
CPU-first inference with GGUF format. Runs anywhere: CPU, NVIDIA, AMD, Apple Silicon.
Use when
- +Local deployment
- +CPU inference
- +Apple Silicon
- +Edge devices
Skip when
- -Maximum GPU throughput
- -Production serving at scale
llama.cpp is a C++ implementation of LLM inference optimized for CPU and edge deployment. It supports multiple backends and quantization formats.
Features
- **Universal**: CPU, CUDA, ROCm, Metal, Vulkan - **GGUF Format**: Flexible quantization (Q2-Q8) - **Low Memory**: Efficient for consumer hardware - **CLI & Server**: Multiple interfaces
Best For
- Local/edge deployment - Consumer hardware - Apple Silicon
Code Examples
Run inference with llama.cppbash
# Interactive mode
./llama-cli -m llama-2-7b.Q4_K_M.gguf \
-p "Hello" -n 100 -i
# Server mode
./llama-server -m model.gguf --port 8080