llama.cpp

TL;DR

CPU-first inference with GGUF format. Runs anywhere: CPU, NVIDIA, AMD, Apple Silicon.

Use when

+Local deployment
+CPU inference
+Apple Silicon
+Edge devices

Skip when

-Maximum GPU throughput
-Production serving at scale

llama.cpp is a C++ implementation of LLM inference optimized for CPU and edge deployment. It supports multiple backends and quantization formats.

Features

- **Universal**: CPU, CUDA, ROCm, Metal, Vulkan - **GGUF Format**: Flexible quantization (Q2-Q8) - **Low Memory**: Efficient for consumer hardware - **CLI & Server**: Multiple interfaces

Best For

- Local/edge deployment - Consumer hardware - Apple Silicon

Code Examples

Run inference with llama.cppbash

# Interactive mode
./llama-cli -m llama-2-7b.Q4_K_M.gguf \
    -p "Hello" -n 100 -i

# Server mode
./llama-server -m model.gguf --port 8080

References

💻
llama.cpp GitHub
Repository