TensorRT-LLM

TL;DR

NVIDIA's optimized LLM inference. Maximum performance on NVIDIA hardware.

Use when

  • +Maximum NVIDIA performance
  • +Production at scale
  • +Have engineering resources

Skip when

  • -Quick prototyping
  • -Non-NVIDIA hardware
  • -Frequent model changes

TensorRT-LLM is NVIDIA's library for optimizing and deploying LLMs. It compiles models to optimized TensorRT engines for maximum performance.

Features

- **Graph Optimization**: Fused kernels, operator fusion - **Quantization**: INT8, INT4, FP8 with calibration - **Multi-GPU**: Tensor and pipeline parallelism - **In-flight Batching**: Continuous batching support

Trade-offs

- **Pro**: Maximum performance on NVIDIA - **Con**: Longer setup, NVIDIA-only

Code Examples

Build TensorRT-LLM enginebash
# Convert and build optimized engine
python convert_checkpoint.py \
    --model_dir ./llama-7b \
    --output_dir ./trt_ckpt

trtllm-build \
    --checkpoint_dir ./trt_ckpt \
    --output_dir ./trt_engine \
    --gemm_plugin float16