File size: 13,744 Bytes
a209eeb 5b3b0f4 a209eeb 5b3b0f4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
---
title: Diving Into The Fusion Of Monocular Priors For Generalized Stereo Matching
emoji: π»
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# [ICCV25] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
Detailed images can be found at [Google Driver](https://drive.google.com/file/d/1u2u_-AgxkdtnkQENEf1d2JjtutwrtCPb/view?usp=sharing)
<!-- > β οΈ **Warning**: It is highly recommended to view this markdown in a preview formatοΌ -->
<!-- > β οΈ **Warning**: We strongly recommend researchers retrain the model on GPUs other than A40 for better results. -->
## Requirements
```Shell
conda env create -f envs/environment_GStereo.yaml
conda activate raftstereo
```
## Required Data
```Shell
βββ datasets
βββ sceneflow
βββ driving
βΒ Β βββ disparity
βΒ Β βββ frames_cleanpass
βΒ Β βββ frames_finalpass
βββ flying3d
βΒ Β βββ disparity
βΒ Β βββ frames_cleanpass
βΒ Β βββ frames_finalpass
βββ monkaa
βββ disparity
βββ frames_cleanpass
βββ frames_finalpass
βββ Kitti15
βββ testing
βΒ Β βββ image_2
βΒ Β βββ image_3
βββ training
βββ disp_noc_0
βββ disp_noc_1
βββ disp_occ_0
βββ disp_occ_1
βββ flow_noc
βββ flow_occ
βββ image_2
βββ image_3
βββ obj_map
βββ Kitti12
βββ testing
βΒ Β βββ calib
βΒ Β βββ colored_0
βΒ Β βββ colored_1
βΒ Β βββ disp_noc
βΒ Β βββ disp_occ
βΒ Β βββ flow_noc
βΒ Β βββ flow_occ
βΒ Β βββ image_0
βΒ Β βββ image_1
βββ training
βββ calib
βββ colored_0
βββ colored_1
βββ Middlebury
βββ MiddEval3
βββ testF
βββ testH
βββ testQ
βββ trainingF
βββ trainingH
βββ trainingQ
βββ ETH3D
βββ two_view_testing
βββ two_view_training
Β Β βββ delivery_area_1l
Β Β βββ delivery_area_1s
Β Β βββ delivery_area_2l
βββ Booster
βββ test
βΒ Β βββ balanced
βΒ Β βββ unbalanced
βββ train
βββ balanced
βββ unbalanced
```
## Code
All codes are provided here, including DepthAnything v2.
Since we modified `dpt.py` to get intermediate features and depth output, please use the modified code.
- ### Training
All training script is presented in [script/train_stereo_raftstereo.sh](script/train_stereo_raftstereo.sh) and [script/train_stereo_raftstereo_depthany.sh](script/train_stereo_raftstereo_depthany.sh).
Please specify the following variable in scripts before training.
| variable | meaning |
|---------------|----------------------|
| `NCCL_P2P_DISABLE` | We set `NCCL_P2P_DISABLE=1` as the distributed training went wrong at our `A40` GPU. |
| `CUDA_VISIBLE_DEVICES` | avaliable GPU id, e.g., `CUDA_VISIBLE_DEVICES=0,1,2,3` |
| `DATASET_ROOT` | the training dataset path, e.g., `./datasets/sceneflow` |
| `LOG_ROOT` | path to save log file |
| `TB_ROOT` | path to save tensorboard data |
| `CKPOINT_ROOT` | path to save checkpoint |
In order to reproduce our results, please download `depth_anything_v2_vitl.pth` from DepthAnything v2 before training and specify `--depthany_model_dir` in script shell to path of directory where `depth_anything_v2_vitl.pth` is saved. Here, we do not provide the link as it maybe conflicts to the CVPR guideline.
We also explain the code for ablation study, in which each experiment is mostly controlled by the `--model_name` used in the training shell.
| `--model_name` | meaning |
|-----------------|-------------------------|
| `RaftStereo` | Original RaftStereo model |
| `RaftStereoDisp` | The output of GRU is a single channel for disparity instead of two channels for optical flow, `Baseline` in Table 3 of the main text. |
| `RAFTStereoMast3r` | The pre-trained MASt3R is used as the backbone, and its features are used for cost volume construction, `RaftStereo + backbone Mast3r` in supplemental text. |
| `RaftStereoNoCTX` | RaftStereo model without context network, `Baseline w/o mono feature` in Table 3 of the main text. |
| `RAFTStereoDepthAny` | RaftStereo model with our monocular encoder, `Baseline + ME` in Table 3 of the main text. |
| `RAFTStereoDepthFusion` | RaftStereo model with our monocular encoder, `Baseline + ME + IDF` in Table 3 of the main text. |
| `RAFTStereoDepthBeta` | RaftStereo model with our monocular encoder and iterative local fusion, `Baseline + ME + ILF` in Table 3 of the main text. |
| `RAFTStereoDepthBetaNoLBP` | RaftStereo model with our monocular encoder and iterative local fusion without LBPEncoder, `L(6)` and `L(7)` in Table 4 of the main text. |
| `RAFTStereoDepthMatch` | RaftStereo model with DepthAnything v2 as feature extractor for cost volume construction, `RaftStereo + backbone DepthAnything` in the supplemental text. |
| `RAFTStereoDepthPostFusion` | RaftStereo model with our monocular encoder, iterative local fusion and post fusion, `Baseline + ME + PF` in Table 3 of the main text. |
| `RAFTStereoDepthBetaRefine` | RaftStereo model with our monocular encoder, iterative local fusion, and global fusion, `Baseline + ME + ILF + GF` in Table 3 of the main text. |
| variable | meaning |
|--------------------------|-------------------------|
| `--lbp_neighbor_offsets` | control `LBP Kernel` used in Table 4 of the main text. |
| `--modulation_ratio` | control `r` amplitude parameter used in Table 4 of the main text. |
| `--conf_from_fea` | `Cost` or `Hybrid` for `Confidence` used in Table 4 of the main text. |
| `--refine_pool` | learning registration parameters via pooling in the supplemental text. |
The training is launched by following
```Shell
bash ./script/train_stereo_raftstereo_depthany.sh EXP_NAME
```
`EXP_NAME` specifies the experiment name. We use this name to save each log file, tensorboard data, and checkpoint for different experiments. The corresponding file structure is as follows
```Shell
βββ runs
Β Β βββ ckpoint
β βββ RaftStereoDepthAny
β βββ RaftStereoMast3r
β βββ RaftStereoNoCTX
Β Β βββ log
β βββ RaftStereoDepthAny
β βββ RaftStereoMast3r
β βββ RaftStereoNoCTX
Β Β βββ tboard
βββ RaftStereoDepthAny
βββ RaftStereoMast3r
βββ RaftStereoNoCTX
```
> β οΈ **Warning**: **Please follow the training process mentioned in our main text.** We first train the model without the global fusion module. Then, we train the monocular registration of the global fusion module while keeping the other modules frozen with a well-trained model from the first stage. Finally, we train the entire global fusion module while keeping the other modules frozen with a well-trained model from the second stage.
- ### Evaluation
The evaluation script is presented in [script/evaluate_stereo_raftstereo.sh](script/evaluate_stereo_raftstereo.sh).
We use `--test_exp_name` to specify the evaluation experiment name.
The results of each experiment are restored in `LOG_ROOT/eval.xlsx`. We also merge all experiments' results in `LOG_ROOT/merged_eval.xlsx` through `python3 merge_sheet.py`.
The evaluation metrics remain the same for different methods.
The `mean Β± std` is computed via [tools/get_statistics.py](tools/get_statistics.py).
- ### Visualization
We visualize the error map via [script/gen_sample_stereo_raftstereo.sh](script/gen_sample_stereo_raftstereo.sh) and intermediate results via [script/vis_inter_stereo_raftstereo.sh](script/vis_inter_stereo_raftstereo.sh).
We provide an easy-to-use visualization toolbox to fully understand each module.
- ### Demo
The model weights, pre-trained on SceneFlow, can be downloaded from [Google Drive](https://drive.google.com/file/d/1T1o7soh3p4C_tHzmUd0ZCtnQbVczPmXz/view?usp=sharing).
The demo used to infer disparity maps from custom image pairs is presented in `infer_stereo_raftstereo.py`. For specific usage, please refer to `script/infer_stereo_raftstereo.sh`.
## More Results
The results after using our custom synthetic data [Trans Dataset](https://github.com/BFZD233/TranScene), which is built for multi-label transparent scenes.
<table>
<thead>
<tr>
<th rowspan="3">Method</th>
<th colspan="21">Booster</th>
</tr>
<tr>
<th colspan="7">ALL</th>
<th colspan="7">Trans</th>
<th colspan="7">No_Trans</th>
</tr>
<tr>
<th>EPE</th>
<th>RMSE</th>
<th>2px</th>
<th>3px</th>
<th>5px</th>
<th>6px</th>
<th>8px</th>
<th>EPE</th>
<th>RMSE</th>
<th>2px</th>
<th>3px</th>
<th>5px</th>
<th>6px</th>
<th>8px</th>
<th>EPE</th>
<th>RMSE</th>
<th>2px</th>
<th>3px</th>
<th>5px</th>
<th>6px</th>
<th>8px</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ours</td>
<td>2.26</td>
<td>5.60</td>
<td>11.02</td>
<td>8.59</td>
<td>6.60</td>
<td>6.00</td>
<td>5.35</td>
<td>7.93</td>
<td>11.03</td>
<td>59.83</td>
<td>50.36</td>
<td>38.44</td>
<td>33.87</td>
<td>27.56</td>
<td>1.52</td>
<td>3.93</td>
<td>6.98</td>
<td>4.97</td>
<td>3.64</td>
<td>3.27</td>
<td>2.89</td>
</tr>
<tr>
<td>Ours+Trans</td>
<td>1.24</td>
<td>4.19</td>
<td>7.91</td>
<td>5.97</td>
<td>4.52</td>
<td>4.08</td>
<td>3.44</td>
<td>5.67</td>
<td>8.42</td>
<td>46.78</td>
<td>38.55</td>
<td>28.65</td>
<td>25.41</td>
<td>21.30</td>
<td>0.75</td>
<td>3.07</td>
<td>4.77</td>
<td>3.23</td>
<td>2.29</td>
<td>2.01</td>
<td>1.59</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th rowspan="3">Method</th>
<th colspan="28">Booster</th>
</tr>
<tr>
<th colspan="7">Class 0</th>
<th colspan="7">Class 1</th>
<th colspan="7">Class 2</th>
<th colspan="7">Class 3</th>
</tr>
<tr>
<th>EPE</th>
<th>RMSE</th>
<th>2px</th>
<th>3px</th>
<th>5px</th>
<th>6px</th>
<th>8px</th>
<th>EPE</th>
<th>RMSE</th>
<th>2px</th>
<th>3px</th>
<th>5px</th>
<th>6px</th>
<th>8px</th>
<th>EPE</th>
<th>RMSE</th>
<th>2px</th>
<th>3px</th>
<th>5px</th>
<th>6px</th>
<th>8px</th>
<th>EPE</th>
<th>RMSE</th>
<th>2px</th>
<th>3px</th>
<th>5px</th>
<th>6px</th>
<th>8px</th>
</tr>
</thead>
<tbody>
<tr>
<td>Ours</td>
<td>0.79</td>
<td>3.02</td>
<td>5.90</td>
<td>4.57</td>
<td>3.17</td>
<td>2.58</td>
<td>1.45</td>
<td>1.53</td>
<td>4.70</td>
<td>12.67</td>
<td>7.80</td>
<td>4.88</td>
<td>3.96</td>
<td>3.14</td>
<td>5.32</td>
<td>6.39</td>
<td>23.34</td>
<td>17.62</td>
<td>13.50</td>
<td>12.80</td>
<td>12.15</td>
<td>7.93</td>
<td>11.03</td>
<td>59.83</td>
<td>50.36</td>
<td>38.44</td>
<td>33.87</td>
<td>27.56</td>
</tr>
<tr>
<td>Ours+Trans</td>
<td>0.75</td>
<td>2.99</td>
<td>5.15</td>
<td>4.08</td>
<td>3.00</td>
<td>2.59</td>
<td>1.73</td>
<td>1.40</td>
<td>4.74</td>
<td>9.17</td>
<td>5.63</td>
<td>3.80</td>
<td>3.37</td>
<td>2.86</td>
<td>1.62</td>
<td>2.26</td>
<td>13.51</td>
<td>10.23</td>
<td>7.40</td>
<td>6.50</td>
<td>4.93</td>
<td>5.67</td>
<td>8.42</td>
<td>46.78</td>
<td>38.55</td>
<td>28.65</td>
<td>25.41</td>
<td>21.30</td>
</tr>
</tbody>
</table>
|