Continuous Batching

Speed:10-20x throughput vs static batching

TL;DR

Dynamic batching that adds/removes requests as they complete. Maximizes GPU utilization.

Use when

  • +Production serving
  • +Variable-length requests
  • +High throughput needed

Skip when

  • -Single request inference
  • -Batch inference with same lengths

Continuous batching (also called in-flight batching) dynamically adjusts the batch as requests complete, rather than waiting for all requests to finish.

Traditional vs Continuous

**Static Batching**: Wait for all requests to complete, then start next batch. GPU idles while waiting for slowest request. **Continuous Batching**: As each request finishes, immediately add a new one. GPU stays saturated.

Key Benefits

- **Utilization**: Near-100% GPU usage - **Latency**: No waiting for batch completion - **Throughput**: 10-20x improvement possible

Code Examples

vLLM server uses continuous batchingbash
# Start vLLM server (continuous batching is default)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --port 8000