File size: 14,706 Bytes
aae3ba1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 |
# Instructions for Preparing Human Hand V-L-A Data
This folder provides essential documentation and scripts for the human hand VLA data used in this project.
**Please note that the metadata we provide may continue to receive updates in the future. Based on manual inspection, the current version achieves roughly 90% annotation accuracy, and we plan to further improve the metadata quality in future updates.**
The contents of this folder are as follows:
## π Table of Contents
- [1. Prerequisites](#1-prerequisites)
- [2. Data Download](#2-data-download)
- [3. Video Preprocessing](#3-video-preprocessing)
- [4. Metadata Structure](#4-metadata-structure)
- [5. Data Visualization](#5-data-visualization)
---
## 1. Prerequisites
Our data preprocessing and visualization rely on several dependencies that need to be prepared in advance. If you have already completed the installation steps in **1.2 Visualization Requirements** of the [`readme.md`](../readme.md), you can skip this section.
### Python Libraries
[PyTorch3D](https://github.com/facebookresearch/pytorch3d?tab=readme-ov-file) is required for visualization. You can install it according to the official guide, or simply run the command below:
```bash
pip install --no-build-isolation git+https://github.com/facebookresearch/pytorch3d.git@stable#egg=pytorch3d
```
[FFmpeg](https://github.com/FFmpeg/FFmpeg) is also required for video processing:
```bash
sudo apt install ffmpeg
pip install ffmpeg-python
```
Other Python dependencies can be installed using the following command:
```bash
pip install projectaria_tools smplx
pip install --no-build-isolation git+https://github.com/mattloper/chumpy#egg=chumpy
```
### MANO Hand Model
Our reconstructed hand labels are based on the MANO hand model. **We only require the right hand model.** The model parameters can be downloaded from the [official website](https://mano.is.tue.mpg.de/index.html) and organized in the following structure:
```
weights/
βββ mano/
βββ MANO_RIGHT.pkl
βββ mano_mean_params.npz
```
---
## 2. Data Download
### Meta Information
We provide the metadata for the human V-L-A episodes we constructed, which can be downloaded from [this link](https://huggingface.co/datasets/VITRA-VLA/VITRA-1M). Each metadata entry contains the segmentation information of the corresponding V-L-A episode, language descriptions, as well as reconstructed camera parameters and 3D hand information. The detailed structure of the metadata can be found at [Metadata Structure](#4-metadata-structure). The total size of all metadata is approximately 100 GB.
After extracting the files, the downloaded metadata will have the following structure:
```
Metadata/
βββ {dataset_name1}/
β βββ episode_frame_index.npz
| βββ episodic_annotations/
β βββ {dataset_name1}_{video_name1}_ep_{000000}.npy
β βββ {dataset_name1}_{video_name1}_ep_{000001}.npy
β βββ {dataset_name1}_{video_name1}_ep_{000002}.npy
β βββ {dataset_name1}_{video_name2}_ep_{000000}.npy
β βββ {dataset_name1}_{video_name2}_ep_{000001}.npy
β βββ ...
βββ {dataset_name2}/
β βββ ...
```
Here, {dataset_name} indicates which dataset the episode belongs to, {video_name} corresponds to the name of the original raw video, and ep_{000000} is the episodeβs index.
### Videos
Our project currently uses videos collected from four sources: [Ego4D](https://ego4d-data.org/#), [Epic-Kitchen](https://epic-kitchens.github.io/2025), [EgoExo4D](https://ego-exo4d-data.org/#intro), and [Something-Something V2](https://www.qualcomm.com/developer/software/something-something-v-2-dataset). Due to license restrictions, we cannot provide our processed video data directly. To access the data, please apply for and download the original videos from the official dataset websites. Note that we only need the _raw video_ files for this project.
The structure of the downloaded raw data for each dataset is as follows:
- **Ego4D**:
```
Ego4D_root/
βββ v2/
βββ full_scale/
βββ {video_name1}.mp4
βββ {video_name2}.mp4
βββ {video_name3}.mp4
βββ ...
```
- **Epic-Kitchen**:
```
Epic-Kitchen_root/
βββ P01/
β βββ videos/
β βββ {video_name1}.MP4
β βββ {video_name2}.MP4
β βββ ...
βββ P02/
β βββ videos/
β βββ {video_name3}.MP4
β βββ {video_name4}.MP4
β βββ ...
βββ ...
```
- **EgoExo4D**:
```
EgoExo4D_root/
βββ takes/
βββ {video_name1}/
β βββ frame_aligned_videos/
β βββ {cam_name1}.mp4
β βββ {cam_name2}.mp4
β βββ ...
βββ {video_name2}/
β βββ frame_aligned_videos/
β βββ {cam_name1}.mp4
β βββ {cam_name2}.mp4
β βββ ...
βββ ...
```
- **Somethingsomething-v2**:
```
Somethingsomething-v2_root/
βββ {video_name1}.webm
βββ {video_name2}.webm
βββ {video_name3}.webm
βββ ...
```
---
## 3. Video Preprocessing
A large portion of the raw videos in Ego4D and EgoExo4D have fisheye distortion. To standardize the processing, we corrected the fisheye distortion and converted the videos to a pinhole camera model. Our metadata is based on the resulting undistorted videos. To enable reproduction of our data, we provide scripts to perform this undistortion on the original videos.
### Camera Intrinsics
We provide our estimated intrinsics for raw videos in Ego4D (computed using [DroidCalib](https://github.com/boschresearch/DroidCalib) as described in our paper) and the ground-truth Project Aria intrinsics for EgoExo4D (from the [official repository](https://github.com/EGO4D/ego-exo4d-egopose/tree/main/handpose/data_preparation)). These files can be downloaded via [this link](https://huggingface.co/datasets/VITRA-VLA/VITRA-1M/tree/main/intrinsics) and organized as follows:
```
camera_intrinsics_root/
βββ ego4d/
β βββ {video_name1}.npy
β βββ {video_name2}.npy
β βββ ...
βββ egoexo4d/
βββ {video_name3}.json
βββ {video_name4}.json
βββ ...
```
### Video Undistortion
Given the raw videos organized according to the structure described in [Data Download](#2-data-download) and the provided camera intrinsics, the fisheye-distorted videos can be undistorted using the following script:
```bash
cd data/preprocessing
# for Ego4D videos
usage: undistort_video.py [-h] --video_root VIDEO_ROOT --intrinsics_root INTRINSICS_ROOT --save_root SAVE_ROOT [--video_start START_IDX] [--video_END END_IDX] [--batchsize BATCHSIZE] [--crf CRF]
options:
-h, --help show this help message and exit
--video_root VIDEO_ROOT Folder containing input videos
--intrinsics_root INTRINSICS_ROOT Folder containing intrinsics info
--save_root SAVE_ROOT Folder for saving output videos
--video_start VIDEO_START Start video index (inclusive)
--video_end VIDEO_END End video index (exclusive)
--batch_size BATCH_SIZE Number of frames to be processed per batch (TS chunk)
--crf CRF CRF for ffmpeg encoding quality
```
An example command is:
```bash
# for Ego4D videos
python undistort_video.py --video_root Ego4D_root/v2/full_scale --intrinsics_root camera_intrinsics_root/ego4d --save_root Ego4D_undistorted_root --video_start 0 --video_end 10
```
which processes 10 Ego4D videos sequentially and saves the undistorted outputs to ``Ego4D_root/v2/undistorted_videos``.
Similarly, for EgoExo4D videos, you can run a command like:
```bash
# for EgoEXO4D videos
python undistort_video_egoexo4d.py --video_root EgoExo4D_root --intrinsics_root camera_intrinsics_root/egoexo4d --save_root EgoExo4D_undistorted_root --video_start 0 --video_end 10
```
Each video is processed in segments according to the specified batch size and then concatenated afterward. Notably, processing the entire dataset is time-consuming and requires substantial storage (around 10 TB). The script provided here is only a basic reference example. **We recommend parallelizing and optimizing it before running it on a compute cluster.**
**The undistortion step is only applied to Ego4D and EgoExo4D videos. Epic-Kitchen and Somethingsomething-v2 do not require undistortion and can be used directly as downloaded from the official sources.**
---
## 4. Metadata Structure
Our metadata for each V-L-A episode can be loaded via:
```python
import numpy as np
# Load meta data dictionary
episode_info = np.load(f'{dataset_name1}_{video_name1}_ep_{000000}.npy', allow_pickle=True).item()
```
The detailed structure of the ``episode_info`` is as follows:
```
episode_info (dict) # Metadata for a single V-L-A episode
βββ 'video_clip_id_segment': list[int] # Deprecated
βββ 'extrinsics': np.ndarray # (Tx4x4) World2Cam extrinsic matrix
βββ 'intrinsics': np.ndarray # (3x3) Camera intrinsic matrix
βββ 'video_decode_frame': list[int] # Frame indices in the original raw video (starting from 0)
βββ 'video_name': str # Original raw video name
βββ 'avg_speed': float # Average wrist movement per frame (in meters)
βββ 'total_rotvec_degree': float # Total camera rotation over the episode (in degrees)
βββ 'total_transl_dist': float # Total camera translation distance over the episode (in meters)
βββ 'anno_type': str # Annotation type, specifying the primary hand action considered when segmenting the episode
βββ 'text': (dict) # Textual descriptions for the episode
β βββ 'left': List[(str, (int, int))] # Each entry contains (description, (start_frame_in_episode, end_frame_in_episode))
β βββ 'right': List[(str, (int, int))] # Same structure for the right hand
βββ 'text_rephrase': (dict) # Rephrased textual descriptions from GPT-4
β βββ 'left': List[(List[str], (int, int))] # Each entry contains (list of rephrased descriptions, (start_frame_in_episode, end_frame_in_episode))
β βββ 'right': List[(List[str], (int, int))] # Same as above for the right hand
βββ 'left' (dict) # Left hand 3D pose info
β βββ 'beta': np.ndarray # (10) MANO hand shape parameters (based on the MANO_RIGHT model)
β βββ 'global_orient_camspace': np.ndarray # (Tx3x3) Hand wrist rotations from MANO's canonical space to camera space
β βββ 'global_orient_worldspace': np.ndarray # (Tx3x3) Hand wrist rotations from MANO's canonical space to world space
β βββ 'hand_pose': np.ndarray # (Tx15x3x3) Local hand joints rotations (based on the MANO_RIGHT model)
β βββ 'transl_camspace': np.ndarray # (Tx3) Hand wrist translation in camera space
β βββ 'transl_worldspace': np.ndarray # (Tx3) Hand wrist translation in world space
β βββ 'kept_frames': list[int] # (T) 0β1 mask of valid left-hand reconstruction frames
β βββ 'joints_camspace': np.ndarray # (Tx21x3) 3D hand joint positions in camera space
β βββ 'joints_worldspace': np.ndarray # (Tx21x3) 3D joint positions in world space
β βββ 'wrist': np.ndarray # Deprecated
β βββ 'max_translation_movement': float # Deprecated
β βββ 'max_wrist_rotation_movement': float # Deprecated
β βββ 'max_finger_joint_angle_movement': float # Deprecated
βββ 'right' (dict) # Right hand 3D pose info (same structure as 'left')
βββ 'beta': np.ndarray
βββ 'global_orient_camspace': np.ndarray
βββ 'global_orient_worldspace': np.ndarray
βββ 'hand_pose': np.ndarray
βββ 'transl_camspace': np.ndarray
βββ 'transl_worldspace': np.ndarray
βββ 'kept_frames': list[int]
βββ 'joints_camspace': np.ndarray
βββ 'joints_worldspace': np.ndarray
βββ 'wrist': np.ndarray
βββ 'max_translation_movement': float
βββ 'max_wrist_rotation_movement': float
βββ 'max_finger_joint_angle_movement': float
```
To better understand how to use the episode metadata, we provide a visualization script, as described in the next section.
---
## 5. Data Visualization
Our metadata for each episode can be visualized with the following command, which will generate a video in the same format as shown on [our webpage](https://microsoft.github.io/VITRA/).
We recommend following the undistortion procedure described above and place all undistorted videos in a single ``video_root`` folder, store the corresponding metadata in a ``label_root`` folder, and then run the visualization script.
```bash
usage: data/demo_visualization_epi.py [-h] --video_root VIDEO_ROOT --label_root LABEL_ROOT --save_path SAVE_PATH --mano_model_path MANO_MODEL_PATH [--render_gradual_traj]
options:
-h, --help show this help message and exit
--video_root VIDEO_ROOT Root directory containing the video files
--label_root LABEL_ROOT Root directory containing the episode label (.npy) files
--save_path SAVE_PATH Directory to save the output visualization videos
--mano_model_path MANO_MODEL_PATH Path to the MANO model files
--render_gradual_traj Set flag to render a gradual trajectory (full mode)
```
We provide an example command for running the script, as well as a sample for visualization:
```bash
python data/demo_visualization_epi.py --video_root data/examples/videos --label_root data/examples/annotations --save_path data/examples/visualize --mano_model_path MANO_MODEL_PATH --render_gradual_traj
```
Note that using ``--render_gradual_traj`` renders the hand trajectory from the current frame to the end of the episode for every frame, which can be slow. To speed up visualization, you may omit this option.
For a more detailed understanding of the metadata, please see ``visualization/visualize_core.py``.
|