Continuous Batching
Speed:10-20x throughput vs static batching
TL;DR
Dynamic batching that adds/removes requests as they complete. Maximizes GPU utilization.
Use when
- +Production serving
- +Variable-length requests
- +High throughput needed
Skip when
- -Single request inference
- -Batch inference with same lengths
Continuous batching (also called in-flight batching) dynamically adjusts the batch as requests complete, rather than waiting for all requests to finish.
Traditional vs Continuous
**Static Batching**: Wait for all requests to complete, then start next batch. GPU idles while waiting for slowest request. **Continuous Batching**: As each request finishes, immediately add a new one. GPU stays saturated.
Key Benefits
- **Utilization**: Near-100% GPU usage - **Latency**: No waiting for batch completion - **Throughput**: 10-20x improvement possible
Code Examples
vLLM server uses continuous batchingbash
# Start vLLM server (continuous batching is default)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--port 8000