This guide demonstrates how to add support for your own robot embodiment in GR00T. The modality configuration defines how your robot’s data should be loaded, processed, and interpreted by the model.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/NVIDIA/Isaac-GR00T/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Each embodiment requires a Python configuration file that specifies:- Which observations to use (video cameras, proprioceptive states)
- How to sample data temporally (current frame, historical frames, future action horizons)
- How actions should be interpreted and transformed
- Which language annotations to use
Configuration structure
A modality configuration is a Python dictionary containing four top-level keys:"video", "state", "action", and "language". Each key maps to a ModalityConfig object.
Here’s the SO-100 example from examples/SO100/so100_config.py:
Understanding ModalityConfig
EachModalityConfig specifies two required fields and several optional ones.
Required fields
delta_indices (list[int])
Defines which temporal offsets to sample relative to the current timestep. This enables:- Historical context: Use negative indices (e.g.,
[-2, -1, 0]) to include past observations - Current observation: Use
[0]for the current timestep - Future actions: Use positive indices (e.g.,
list(range(0, 16))) for action prediction horizons
modality_keys (list[str])
Specifies which keys to load from your dataset. These keys must match the keys defined in yourmeta/modality.json file.
For the SO-100 example:
- Video keys: Must match keys in
meta/modality.jsonunder"video"(e.g.,"front","wrist") - State keys: Must match keys in
meta/modality.jsonunder"state"(e.g.,"single_arm","gripper") - Action keys: Must match keys in
meta/modality.jsonunder"action"(e.g.,"single_arm","gripper") - Language keys: Must match keys in
meta/modality.jsonunder"annotation"(e.g.,"annotation.human.action.task_description")
Optional fields
sin_cos_embedding_keys (list[str] | None)
Specifies which state keys should use sine/cosine encoding. Best for dimensions that are in radians (e.g., joint angles). If not specified, min-max normalization is used.Sine/cosine embedding will duplicate the number of dimensions by 2, and is only recommended for proprioceptive states.
mean_std_embedding_keys (list[str] | None)
Specifies which keys should use mean/standard deviation normalization instead of min-max normalization.action_configs (list[ActionConfig] | None)
Required for the"action" modality. Defines how each action modality should be interpreted and transformed. The list must have the same length as modality_keys.
Configuring each modality
Video modality
Defines which camera views to use:State modality
Defines proprioceptive observations (joint positions, gripper states, etc.):Action modality
Defines the action space and prediction horizon:Language modality
Defines which language annotations to use:Understanding ActionConfig
EachActionConfig has three required fields and one optional field.
rep (ActionRepresentation)
Defines how actions should be interpreted:RELATIVE: Actions are deltas from the current state (introduced in the UMI paper)ABSOLUTE: Actions are target positions
type (ActionType)
Specifies the control space:EEF: End-effector/Cartesian space control (expecting a 9-dimensional vector: x, y, z positions + rotation 6D)NON_EEF: Joint space control and other non-EEF control spaces (joint angles, positions, gripper positions, etc.)
format (ActionFormat)
Defines the action representation format:DEFAULT: Standard format (e.g., joint angles, gripper positions)XYZ_ROT6D: 3D position + 6D rotation representation for end-effector controlXYZ_ROTVEC: 3D position + rotation vector for end-effector control
state_key (str | None)
Optional. Specifies the corresponding reference state key for computing relative actions whenrep=RELATIVE. If not provided, the system will use the action key as the reference state key.
Example with state_key:
Complete example: SO-100
Here’s the complete SO-100 configuration:examples/SO100/so100_config.py
Key relationships with meta/modality.json
The modality configuration’smodality_keys must reference keys that exist in your dataset’s meta/modality.json.
Example meta/modality.json:
- Use
modality_keysto look up the corresponding entries inmeta/modality.json - Extract the correct slices from the concatenated state/action arrays
- Apply the specified transformations (normalization, action representation conversion)
Registering your configuration
After defining your configuration, register it so it’s available to the training and inference pipelines:--modality-config-path argument when running the fine-tuning script.