TensorRT-LLM

TL;DR

NVIDIA's optimized LLM inference. Maximum performance on NVIDIA hardware.

Use when

+Maximum NVIDIA performance
+Production at scale
+Have engineering resources

Skip when

-Quick prototyping
-Non-NVIDIA hardware
-Frequent model changes

TensorRT-LLM is NVIDIA's library for optimizing and deploying LLMs. It compiles models to optimized TensorRT engines for maximum performance.

Features

- **Graph Optimization**: Fused kernels, operator fusion - **Quantization**: INT8, INT4, FP8 with calibration - **Multi-GPU**: Tensor and pipeline parallelism - **In-flight Batching**: Continuous batching support

Trade-offs

- **Pro**: Maximum performance on NVIDIA - **Con**: Longer setup, NVIDIA-only

Code Examples

Build TensorRT-LLM enginebash

# Convert and build optimized engine
python convert_checkpoint.py \
    --model_dir ./llama-7b \
    --output_dir ./trt_ckpt

trtllm-build \
    --checkpoint_dir ./trt_ckpt \
    --output_dir ./trt_engine \
    --gemm_plugin float16

References

💻
TensorRT-LLM GitHub
Repository
📖
TensorRT-LLM Documentation
Documentation