GR00T supports multiple optimization techniques to improve inference speed, from PyTorch eager mode to torch.compile and TensorRT acceleration.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NVIDIA/Isaac-GR00T/llms.txt
Use this file to discover all available pages before exploring further.
Performance overview
GR00T-N1.6-3B inference timing with 4 denoising steps: | Device | Mode | Data Processing | Backbone | Action Head | E2E | Frequency | |--------|------|-----------------|----------|-------------|-----|-----------|| | RTX 5090 | PyTorch Eager | 2 ms | 18 ms | 38 ms | 58 ms | 17.3 Hz | | RTX 5090 | torch.compile | 2 ms | 18 ms | 16 ms | 37 ms | 27.3 Hz | | RTX 5090 | TensorRT | 2 ms | 18 ms | 11 ms | 31 ms | 32.1 Hz | | H100 | PyTorch Eager | 4 ms | 23 ms | 49 ms | 77 ms | 13.0 Hz | | H100 | torch.compile | 4 ms | 23 ms | 11 ms | 38 ms | 26.3 Hz | | H100 | TensorRT | 4 ms | 22 ms | 10 ms | 36 ms | 27.9 Hz | | RTX 4090 | PyTorch Eager | 2 ms | 25 ms | 55 ms | 82 ms | 12.2 Hz | | RTX 4090 | torch.compile | 2 ms | 25 ms | 17 ms | 44 ms | 22.8 Hz | | RTX 4090 | TensorRT | 2 ms | 24 ms | 16 ms | 43 ms | 23.3 Hz | | Thor | PyTorch Eager | 5 ms | 38 ms | 74 ms | 117 ms | 8.6 Hz | | Thor | torch.compile | 5 ms | 39 ms | 61 ms | 105 ms | 9.5 Hz | | Thor | TensorRT | 5 ms | 38 ms | 49 ms | 92 ms | 10.9 Hz | | Orin | PyTorch Eager | 6 ms | 93 ms | 202 ms | 300 ms | 3.3 Hz | | Orin | torch.compile | 6 ms | 93 ms | 101 ms | 199 ms | 5.0 Hz | | Orin | TensorRT | 6 ms | 95 ms | 72 ms | 173 ms | 5.8 Hz |The backbone (Vision Encoder + Language Model) timing is the same across all modes. Only the Action Head (DiT) is optimized with torch.compile or TensorRT, which is why you see significant speedups in the Action Head column while the Backbone column remains constant.
Speedup comparison
Speedup vs PyTorch Eager mode:| Device | Mode | E2E Speedup | Action Head Speedup |
|---|---|---|---|
| RTX 5090 | PyTorch Eager | 1.00x | 1.00x |
| RTX 5090 | torch.compile | 1.58x | 2.32x |
| RTX 5090 | TensorRT | 1.86x | 3.59x |
| H100 | PyTorch Eager | 1.00x | 1.00x |
| H100 | torch.compile | 2.02x | 4.60x |
| H100 | TensorRT | 2.14x | 4.80x |
| RTX 4090 | PyTorch Eager | 1.00x | 1.00x |
| RTX 4090 | torch.compile | 1.87x | 3.26x |
| RTX 4090 | TensorRT | 1.92x | 3.48x |
| Thor | PyTorch Eager | 1.00x | 1.00x |
| Thor | torch.compile | 1.11x | 1.20x |
| Thor | TensorRT | 1.27x | 1.49x |
| Orin | PyTorch Eager | 1.00x | 1.00x |
| Orin | torch.compile | 1.50x | 2.00x |
| Orin | TensorRT | 1.73x | 2.80x |
PyTorch mode (default)
Run inference without optimization:Installation
torch.compile optimization
PyTorch’s built-in compiler optimizes the action head (DiT) for faster inference:The first inference call will be slower due to compilation. Subsequent calls will benefit from optimized kernels.
Performance characteristics
- RTX 5090: 1.58x faster E2E, 2.32x faster action head
- H100: 2.02x faster E2E, 4.60x faster action head
- RTX 4090: 1.87x faster E2E, 3.26x faster action head
- Orin: 1.50x faster E2E, 2.00x faster action head
TensorRT optimization
TensorRT provides the fastest inference by optimizing and compiling the action head to GPU-specific kernels. See the TensorRT guide for detailed setup.Quick setup
python scripts/deployment/export_onnx_n1d6.py \
--model-path nvidia/GR00T-N1.6-3B \
--dataset-path demo_data/gr1.PickNPlace \
--embodiment-tag GR1 \
--output-dir ./groot_n1d6_onnx
python scripts/deployment/build_tensorrt_engine.py \
--onnx ./groot_n1d6_onnx/dit_model.onnx \
--engine ./groot_n1d6_onnx/dit_model_bf16.trt \
--precision bf16
Performance characteristics
- RTX 5090: 1.86x faster E2E, 3.59x faster action head (31ms E2E, 32.1 Hz)
- H100: 2.14x faster E2E, 4.80x faster action head (36ms E2E, 27.9 Hz)
- RTX 4090: 1.92x faster E2E, 3.48x faster action head
- Orin: 1.73x faster E2E, 2.80x faster action head
Benchmarking your hardware
Run the benchmark script to measure performance on your hardware:Benchmark arguments
| Argument | Default | Description |
|---|---|---|
--model-path | nvidia/GR00T-N1.6-3B | Model checkpoint path |
--dataset-path | demo_data/gr1.PickNPlace | Dataset path |
--embodiment-tag | GR1 | Embodiment tag |
--trt-engine-path | (optional) | TensorRT engine path |
--num-iterations | 20 | Number of benchmark iterations |
--warmup | 5 | Warmup iterations |
--skip-compile | false | Skip torch.compile benchmark |
--seed | 42 | Random seed |
Output example
Architecture
GR00T’s inference pipeline consists of three main components:Optimization selection guide
| Use Case | Recommended Mode | Rationale |
|---|---|---|
| Development/debugging | PyTorch Eager | Easy debugging, no compilation overhead |
| Production (simple setup) | torch.compile | Good speedup, minimal setup |
| Production (maximum performance) | TensorRT | Best performance, requires engine build |
| Edge devices (Jetson) | TensorRT | Optimized for embedded GPUs |
| Rapid prototyping | PyTorch Eager | Fast iteration |
Command-line arguments
standalone_inference_script.py
| Argument | Default | Description |
|---|---|---|
--model-path | (required) | Model checkpoint path |
--dataset-path | (required) | LeRobot dataset path |
--embodiment-tag | GR1 | Embodiment tag |
--traj-ids | [0] | Trajectory IDs to evaluate |
--steps | 200 | Max steps per trajectory |
--action-horizon | 16 | Action horizon |
--inference-mode | pytorch | pytorch or tensorrt |
--trt-engine-path | ./groot_n1d6_onnx/dit_model_bf16.trt | TensorRT engine path |
--denoising-steps | 4 | Denoising steps |
--skip-timing-steps | 1 | Steps to skip for timing (warmup) |
--seed | 42 | Random seed |
--video-backend | torchcodec | Video backend |
Troubleshooting
Compilation errors with torch.compile
Out of memory errors
Reduce batch size or action horizon:Slow first inference
This is expected with torch.compile and TensorRT. Add warmup iterations:Advanced topics
Analyzing inference timing
Use the provided Jupyter notebook for detailed analysis:- Component-wise timing breakdown
- Visualization of speedups across devices
- Comparison of different optimization modes