TensorRT-LLM
TL;DR
NVIDIA's optimized LLM inference. Maximum performance on NVIDIA hardware.
Use when
- +Maximum NVIDIA performance
- +Production at scale
- +Have engineering resources
Skip when
- -Quick prototyping
- -Non-NVIDIA hardware
- -Frequent model changes
TensorRT-LLM is NVIDIA's library for optimizing and deploying LLMs. It compiles models to optimized TensorRT engines for maximum performance.
Features
- **Graph Optimization**: Fused kernels, operator fusion - **Quantization**: INT8, INT4, FP8 with calibration - **Multi-GPU**: Tensor and pipeline parallelism - **In-flight Batching**: Continuous batching support
Trade-offs
- **Pro**: Maximum performance on NVIDIA - **Con**: Longer setup, NVIDIA-only
Code Examples
Build TensorRT-LLM enginebash
# Convert and build optimized engine
python convert_checkpoint.py \
--model_dir ./llama-7b \
--output_dir ./trt_ckpt
trtllm-build \
--checkpoint_dir ./trt_ckpt \
--output_dir ./trt_engine \
--gemm_plugin float16