File size: 14,706 Bytes
aae3ba1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266

# Instructions for Preparing Human Hand V-L-A Data

This folder provides essential documentation and scripts for the human hand VLA data used in this project.
**Please note that the metadata we provide may continue to receive updates in the future. Based on manual inspection, the current version achieves roughly 90% annotation accuracy, and we plan to further improve the metadata quality in future updates.**

The contents of this folder are as follows:

## πŸ“‘ Table of Contents
- [1. Prerequisites](#1-prerequisites)
- [2. Data Download](#2-data-download)
- [3. Video Preprocessing](#3-video-preprocessing)
- [4. Metadata Structure](#4-metadata-structure)
- [5. Data Visualization](#5-data-visualization)

---
## 1. Prerequisites
Our data preprocessing and visualization rely on several dependencies that need to be prepared in advance. If you have already completed the installation steps in **1.2 Visualization Requirements** of the [`readme.md`](../readme.md), you can skip this section.

### Python Libraries
[PyTorch3D](https://github.com/facebookresearch/pytorch3d?tab=readme-ov-file) is required for visualization. You can install it according to the official guide, or simply run the command below:
```bash
pip install --no-build-isolation git+https://github.com/facebookresearch/pytorch3d.git@stable#egg=pytorch3d  
```
[FFmpeg](https://github.com/FFmpeg/FFmpeg) is also required for video processing:
```bash
sudo apt install ffmpeg
pip install ffmpeg-python
```

Other Python dependencies can be installed using the following command:
```bash
pip install projectaria_tools smplx
pip install --no-build-isolation git+https://github.com/mattloper/chumpy#egg=chumpy
```
### MANO Hand Model

Our reconstructed hand labels are based on the MANO hand model. **We only require the right hand model.** The model parameters can be downloaded from the [official website](https://mano.is.tue.mpg.de/index.html) and organized in the following structure:
```
weights/
└── mano/
    β”œβ”€β”€ MANO_RIGHT.pkl
    └── mano_mean_params.npz
```

---

## 2. Data Download

### Meta Information

We provide the metadata for the human V-L-A episodes we constructed, which can be downloaded from [this link](https://huggingface.co/datasets/VITRA-VLA/VITRA-1M). Each metadata entry contains the segmentation information of the corresponding V-L-A episode, language descriptions, as well as reconstructed camera parameters and 3D hand information. The detailed structure of the metadata can be found at [Metadata Structure](#4-metadata-structure). The total size of all metadata is approximately 100 GB.

After extracting the files, the downloaded metadata will have the following structure:
```
Metadata/
β”œβ”€β”€ {dataset_name1}/
β”‚   β”œβ”€β”€ episode_frame_index.npz
|   └── episodic_annotations/
β”‚       β”œβ”€β”€ {dataset_name1}_{video_name1}_ep_{000000}.npy
β”‚       β”œβ”€β”€ {dataset_name1}_{video_name1}_ep_{000001}.npy
β”‚       β”œβ”€β”€ {dataset_name1}_{video_name1}_ep_{000002}.npy
β”‚       β”œβ”€β”€ {dataset_name1}_{video_name2}_ep_{000000}.npy
β”‚       β”œβ”€β”€ {dataset_name1}_{video_name2}_ep_{000001}.npy
β”‚       └── ...
β”œβ”€β”€ {dataset_name2}/
β”‚   └── ...
```
Here, {dataset_name} indicates which dataset the episode belongs to, {video_name} corresponds to the name of the original raw video, and ep_{000000} is the episode’s index.

### Videos

Our project currently uses videos collected from four sources: [Ego4D](https://ego4d-data.org/#), [Epic-Kitchen](https://epic-kitchens.github.io/2025), [EgoExo4D](https://ego-exo4d-data.org/#intro), and [Something-Something V2](https://www.qualcomm.com/developer/software/something-something-v-2-dataset). Due to license restrictions, we cannot provide our processed video data directly. To access the data, please apply for and download the original videos from the official dataset websites. Note that we only need the _raw video_ files for this project.

The structure of the downloaded raw data for each dataset is as follows:
- **Ego4D**:  
```
Ego4D_root/
└── v2/
    └── full_scale/
        β”œβ”€β”€ {video_name1}.mp4
        β”œβ”€β”€ {video_name2}.mp4
        β”œβ”€β”€ {video_name3}.mp4
        └── ...
```
- **Epic-Kitchen**:  
```
Epic-Kitchen_root/
β”œβ”€β”€ P01/
β”‚   └── videos/
β”‚       β”œβ”€β”€ {video_name1}.MP4
β”‚       β”œβ”€β”€ {video_name2}.MP4
β”‚       └── ...
β”œβ”€β”€ P02/
β”‚   └── videos/
β”‚       β”œβ”€β”€ {video_name3}.MP4
β”‚       β”œβ”€β”€ {video_name4}.MP4
β”‚       └── ...
└── ...
```
- **EgoExo4D**:  
```
EgoExo4D_root/
└── takes/
    β”œβ”€β”€ {video_name1}/
    β”‚   └── frame_aligned_videos/
    β”‚       β”œβ”€β”€ {cam_name1}.mp4
    β”‚       β”œβ”€β”€ {cam_name2}.mp4
    β”‚       └── ...
    β”œβ”€β”€ {video_name2}/
    β”‚   └── frame_aligned_videos/
    β”‚       β”œβ”€β”€ {cam_name1}.mp4
    β”‚       β”œβ”€β”€ {cam_name2}.mp4
    β”‚       └── ...
    └── ...
```
- **Somethingsomething-v2**:  
```
Somethingsomething-v2_root/
β”œβ”€β”€ {video_name1}.webm
β”œβ”€β”€ {video_name2}.webm
β”œβ”€β”€ {video_name3}.webm
└── ...
```
---

## 3. Video Preprocessing

A large portion of the raw videos in Ego4D and EgoExo4D have fisheye distortion. To standardize the processing, we corrected the fisheye distortion and converted the videos to a pinhole camera model. Our metadata is based on the resulting undistorted videos. To enable reproduction of our data, we provide scripts to perform this undistortion on the original videos. 

### Camera Intrinsics

We provide our estimated intrinsics for raw videos in Ego4D (computed using [DroidCalib](https://github.com/boschresearch/DroidCalib) as described in our paper) and the ground-truth Project Aria intrinsics for EgoExo4D (from the [official repository](https://github.com/EGO4D/ego-exo4d-egopose/tree/main/handpose/data_preparation)). These files can be downloaded via [this link](https://huggingface.co/datasets/VITRA-VLA/VITRA-1M/tree/main/intrinsics) and organized as follows:
```
camera_intrinsics_root/
β”œβ”€β”€ ego4d/
β”‚   β”œβ”€β”€ {video_name1}.npy
β”‚   β”œβ”€β”€ {video_name2}.npy
β”‚   └── ...
└── egoexo4d/
    β”œβ”€β”€ {video_name3}.json
    β”œβ”€β”€ {video_name4}.json
    └── ...
```
### Video Undistortion
Given the raw videos organized according to the structure described in [Data Download](#2-data-download) and the provided camera intrinsics, the fisheye-distorted videos can be undistorted using the following script:
```bash
cd data/preprocessing

# for Ego4D videos
usage: undistort_video.py [-h] --video_root VIDEO_ROOT --intrinsics_root INTRINSICS_ROOT --save_root SAVE_ROOT [--video_start START_IDX] [--video_END END_IDX] [--batchsize BATCHSIZE] [--crf CRF]

options:
  -h, --help                            show this help message and exit
  --video_root VIDEO_ROOT               Folder containing input videos
  --intrinsics_root INTRINSICS_ROOT     Folder containing intrinsics info
  --save_root SAVE_ROOT                 Folder for saving output videos
  --video_start VIDEO_START             Start video index (inclusive)
  --video_end VIDEO_END                 End video index (exclusive)
  --batch_size BATCH_SIZE               Number of frames to be processed per batch (TS chunk)
  --crf CRF                             CRF for ffmpeg encoding quality
```

An example command is:
```bash
# for Ego4D videos
python undistort_video.py --video_root Ego4D_root/v2/full_scale --intrinsics_root camera_intrinsics_root/ego4d --save_root Ego4D_undistorted_root --video_start 0 --video_end 10
```
which processes 10 Ego4D videos sequentially and saves the undistorted outputs to ``Ego4D_root/v2/undistorted_videos``.

Similarly, for EgoExo4D videos, you can run a command like:
```bash
# for EgoEXO4D videos
python undistort_video_egoexo4d.py --video_root EgoExo4D_root --intrinsics_root camera_intrinsics_root/egoexo4d --save_root EgoExo4D_undistorted_root --video_start 0 --video_end 10
```

Each video is processed in segments according to the specified batch size and then concatenated afterward. Notably, processing the entire dataset is time-consuming and requires substantial storage (around 10 TB). The script provided here is only a basic reference example. **We recommend parallelizing and optimizing it before running it on a compute cluster.**

**The undistortion step is only applied to Ego4D and EgoExo4D videos. Epic-Kitchen and Somethingsomething-v2 do not require undistortion and can be used directly as downloaded from the official sources.**

---

## 4. Metadata Structure
Our metadata for each V-L-A episode can be loaded via:
```python
import numpy as np

# Load meta data dictionary
episode_info = np.load(f'{dataset_name1}_{video_name1}_ep_{000000}.npy', allow_pickle=True).item()

```
The detailed structure of the ``episode_info`` is as follows:
```
episode_info (dict)                                 # Metadata for a single V-L-A episode
β”œβ”€β”€ 'video_clip_id_segment': list[int]              # Deprecated
β”œβ”€β”€ 'extrinsics': np.ndarray                        # (Tx4x4) World2Cam extrinsic matrix
β”œβ”€β”€ 'intrinsics': np.ndarray                        # (3x3) Camera intrinsic matrix
β”œβ”€β”€ 'video_decode_frame': list[int]                 # Frame indices in the original raw video (starting from 0)
β”œβ”€β”€ 'video_name': str                               # Original raw video name
β”œβ”€β”€ 'avg_speed': float                              # Average wrist movement per frame (in meters)
β”œβ”€β”€ 'total_rotvec_degree': float                    # Total camera rotation over the episode (in degrees)
β”œβ”€β”€ 'total_transl_dist': float                      # Total camera translation distance over the episode (in meters)
β”œβ”€β”€ 'anno_type': str                                # Annotation type, specifying the primary hand action considered when segmenting the episode
β”œβ”€β”€ 'text': (dict)                                  # Textual descriptions for the episode
β”‚     β”œβ”€β”€ 'left': List[(str, (int, int))]           # Each entry contains (description, (start_frame_in_episode, end_frame_in_episode))
β”‚     └── 'right': List[(str, (int, int))]          # Same structure for the right hand
β”œβ”€β”€ 'text_rephrase': (dict)                         # Rephrased textual descriptions from GPT-4
β”‚     β”œβ”€β”€ 'left': List[(List[str], (int, int))]     # Each entry contains (list of rephrased descriptions, (start_frame_in_episode, end_frame_in_episode))
β”‚     └── 'right': List[(List[str], (int, int))]    # Same as above for the right hand
β”œβ”€β”€ 'left' (dict)                                   # Left hand 3D pose info
β”‚   β”œβ”€β”€ 'beta': np.ndarray                          # (10) MANO hand shape parameters (based on the MANO_RIGHT model)
β”‚   β”œβ”€β”€ 'global_orient_camspace': np.ndarray        # (Tx3x3) Hand wrist rotations from MANO's canonical space to camera space
β”‚   β”œβ”€β”€ 'global_orient_worldspace': np.ndarray      # (Tx3x3) Hand wrist rotations from MANO's canonical space to world space
β”‚   β”œβ”€β”€ 'hand_pose': np.ndarray                     # (Tx15x3x3) Local hand joints rotations (based on the MANO_RIGHT model)
β”‚   β”œβ”€β”€ 'transl_camspace': np.ndarray               # (Tx3) Hand wrist translation in camera space
β”‚   β”œβ”€β”€ 'transl_worldspace': np.ndarray             # (Tx3) Hand wrist translation in world space
β”‚   β”œβ”€β”€ 'kept_frames': list[int]                    # (T) 0–1 mask of valid left-hand reconstruction frames
β”‚   β”œβ”€β”€ 'joints_camspace': np.ndarray               # (Tx21x3) 3D hand joint positions in camera space
β”‚   β”œβ”€β”€ 'joints_worldspace': np.ndarray             # (Tx21x3) 3D joint positions in world space
β”‚   β”œβ”€β”€ 'wrist': np.ndarray                         # Deprecated
β”‚   β”œβ”€β”€ 'max_translation_movement': float           # Deprecated
β”‚   β”œβ”€β”€ 'max_wrist_rotation_movement': float        # Deprecated
β”‚   └── 'max_finger_joint_angle_movement': float    # Deprecated
└── 'right' (dict)                                  # Right hand 3D pose info (same structure as 'left')
    β”œβ”€β”€ 'beta': np.ndarray
    β”œβ”€β”€ 'global_orient_camspace': np.ndarray
    β”œβ”€β”€ 'global_orient_worldspace': np.ndarray
    β”œβ”€β”€ 'hand_pose': np.ndarray
    β”œβ”€β”€ 'transl_camspace': np.ndarray
    β”œβ”€β”€ 'transl_worldspace': np.ndarray
    β”œβ”€β”€ 'kept_frames': list[int]
    β”œβ”€β”€ 'joints_camspace': np.ndarray
    β”œβ”€β”€ 'joints_worldspace': np.ndarray
    β”œβ”€β”€ 'wrist': np.ndarray
    β”œβ”€β”€ 'max_translation_movement': float
    β”œβ”€β”€ 'max_wrist_rotation_movement': float
    └── 'max_finger_joint_angle_movement': float
```
To better understand how to use the episode metadata, we provide a visualization script, as described in the next section.

---

## 5. Data Visualization
Our metadata for each episode can be visualized with the following command, which will generate a video in the same format as shown on [our webpage](https://microsoft.github.io/VITRA/).   
We recommend following the undistortion procedure described above and place all undistorted videos in a single ``video_root`` folder, store the corresponding metadata in a ``label_root`` folder, and then run the visualization script.

```bash
usage: data/demo_visualization_epi.py [-h] --video_root VIDEO_ROOT --label_root LABEL_ROOT --save_path SAVE_PATH --mano_model_path MANO_MODEL_PATH [--render_gradual_traj]

options:
  -h, --help                            show this help message and exit
  --video_root VIDEO_ROOT               Root directory containing the video files
  --label_root LABEL_ROOT               Root directory containing the episode label (.npy) files
  --save_path SAVE_PATH                 Directory to save the output visualization videos
  --mano_model_path MANO_MODEL_PATH     Path to the MANO model files
  --render_gradual_traj                 Set flag to render a gradual trajectory (full mode)
```
We provide an example command for running the script, as well as a sample for visualization:
```bash
python data/demo_visualization_epi.py --video_root data/examples/videos --label_root data/examples/annotations --save_path data/examples/visualize --mano_model_path MANO_MODEL_PATH --render_gradual_traj
```
Note that using ``--render_gradual_traj`` renders the hand trajectory from the current frame to the end of the episode for every frame, which can be slow. To speed up visualization, you may omit this option.


For a more detailed understanding of the metadata, please see ``visualization/visualize_core.py``.