vLLM
Speed:10-24x vs HuggingFace Transformers
TL;DR
High-throughput LLM serving with PagedAttention. De-facto standard for production inference.
Use when
- +Production LLM serving
- +High throughput needed
- +NVIDIA/AMD GPUs
Skip when
- -CPU-only deployment
- -Apple Silicon
vLLM is a fast and easy-to-use library for LLM inference and serving. It pioneered PagedAttention and continuous batching for LLM serving.
Features
- **PagedAttention**: Efficient KV-cache management - **Continuous Batching**: Dynamic request handling - **Quantization**: GPTQ, AWQ, FP8 support - **Tensor Parallelism**: Multi-GPU inference - **OpenAI-compatible API**: Drop-in replacement
Code Examples
Basic vLLM usagepython
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
outputs = llm.generate(
["Hello, my name is"],
SamplingParams(temperature=0.8, max_tokens=100)
)
print(outputs[0].outputs[0].text)Start OpenAI-compatible serverbash
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--port 8000