llama.cpp

TL;DR

CPU-first inference with GGUF format. Runs anywhere: CPU, NVIDIA, AMD, Apple Silicon.

Use when

  • +Local deployment
  • +CPU inference
  • +Apple Silicon
  • +Edge devices

Skip when

  • -Maximum GPU throughput
  • -Production serving at scale

llama.cpp is a C++ implementation of LLM inference optimized for CPU and edge deployment. It supports multiple backends and quantization formats.

Features

- **Universal**: CPU, CUDA, ROCm, Metal, Vulkan - **GGUF Format**: Flexible quantization (Q2-Q8) - **Low Memory**: Efficient for consumer hardware - **CLI & Server**: Multiple interfaces

Best For

- Local/edge deployment - Consumer hardware - Apple Silicon

Code Examples

Run inference with llama.cppbash
# Interactive mode
./llama-cli -m llama-2-7b.Q4_K_M.gguf \
    -p "Hello" -n 100 -i

# Server mode
./llama-server -m model.gguf --port 8080