vLLM

Speed:10-24x vs HuggingFace Transformers

TL;DR

High-throughput LLM serving with PagedAttention. De-facto standard for production inference.

Use when

+Production LLM serving
+High throughput needed
+NVIDIA/AMD GPUs

Skip when

-CPU-only deployment
-Apple Silicon

vLLM is a fast and easy-to-use library for LLM inference and serving. It pioneered PagedAttention and continuous batching for LLM serving.

Features

- **PagedAttention**: Efficient KV-cache management - **Continuous Batching**: Dynamic request handling - **Quantization**: GPTQ, AWQ, FP8 support - **Tensor Parallelism**: Multi-GPU inference - **OpenAI-compatible API**: Drop-in replacement

Code Examples

Basic vLLM usagepython

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
outputs = llm.generate(
    ["Hello, my name is"],
    SamplingParams(temperature=0.8, max_tokens=100)
)
print(outputs[0].outputs[0].text)

Start OpenAI-compatible serverbash

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --port 8000