vLLM

Speed:10-24x vs HuggingFace Transformers

TL;DR

High-throughput LLM serving with PagedAttention. De-facto standard for production inference.

Use when

  • +Production LLM serving
  • +High throughput needed
  • +NVIDIA/AMD GPUs

Skip when

  • -CPU-only deployment
  • -Apple Silicon

vLLM is a fast and easy-to-use library for LLM inference and serving. It pioneered PagedAttention and continuous batching for LLM serving.

Features

- **PagedAttention**: Efficient KV-cache management - **Continuous Batching**: Dynamic request handling - **Quantization**: GPTQ, AWQ, FP8 support - **Tensor Parallelism**: Multi-GPU inference - **OpenAI-compatible API**: Drop-in replacement

Code Examples

Basic vLLM usagepython
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
outputs = llm.generate(
    ["Hello, my name is"],
    SamplingParams(temperature=0.8, max_tokens=100)
)
print(outputs[0].outputs[0].text)
Start OpenAI-compatible serverbash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --port 8000