A newer version of the Gradio SDK is available:
6.0.2
title: Diving Into The Fusion Of Monocular Priors For Generalized Stereo Matching
emoji: π»
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
[ICCV25] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching
Detailed images can be found at Google Driver
Requirements
conda env create -f envs/environment_GStereo.yaml
conda activate raftstereo
Required Data
βββ datasets
βββ sceneflow
βββ driving
β βββ disparity
β βββ frames_cleanpass
β βββ frames_finalpass
βββ flying3d
β βββ disparity
β βββ frames_cleanpass
β βββ frames_finalpass
βββ monkaa
βββ disparity
βββ frames_cleanpass
βββ frames_finalpass
βββ Kitti15
βββ testing
β βββ image_2
β βββ image_3
βββ training
βββ disp_noc_0
βββ disp_noc_1
βββ disp_occ_0
βββ disp_occ_1
βββ flow_noc
βββ flow_occ
βββ image_2
βββ image_3
βββ obj_map
βββ Kitti12
βββ testing
β βββ calib
β βββ colored_0
β βββ colored_1
β βββ disp_noc
β βββ disp_occ
β βββ flow_noc
β βββ flow_occ
β βββ image_0
β βββ image_1
βββ training
βββ calib
βββ colored_0
βββ colored_1
βββ Middlebury
βββ MiddEval3
βββ testF
βββ testH
βββ testQ
βββ trainingF
βββ trainingH
βββ trainingQ
βββ ETH3D
βββ two_view_testing
βββ two_view_training
βββ delivery_area_1l
βββ delivery_area_1s
βββ delivery_area_2l
βββ Booster
βββ test
β βββ balanced
β βββ unbalanced
βββ train
βββ balanced
βββ unbalanced
Code
All codes are provided here, including DepthAnything v2.
Since we modified dpt.py to get intermediate features and depth output, please use the modified code.
Training
All training script is presented in script/train_stereo_raftstereo.sh and script/train_stereo_raftstereo_depthany.sh. Please specify the following variable in scripts before training.
variable meaning NCCL_P2P_DISABLEWe set NCCL_P2P_DISABLE=1as the distributed training went wrong at ourA40GPU.CUDA_VISIBLE_DEVICESavaliable GPU id, e.g., CUDA_VISIBLE_DEVICES=0,1,2,3DATASET_ROOTthe training dataset path, e.g., ./datasets/sceneflowLOG_ROOTpath to save log file TB_ROOTpath to save tensorboard data CKPOINT_ROOTpath to save checkpoint In order to reproduce our results, please download
depth_anything_v2_vitl.pthfrom DepthAnything v2 before training and specify--depthany_model_dirin script shell to path of directory wheredepth_anything_v2_vitl.pthis saved. Here, we do not provide the link as it maybe conflicts to the CVPR guideline. We also explain the code for ablation study, in which each experiment is mostly controlled by the--model_nameused in the training shell.--model_namemeaning RaftStereoOriginal RaftStereo model RaftStereoDispThe output of GRU is a single channel for disparity instead of two channels for optical flow, Baselinein Table 3 of the main text.RAFTStereoMast3rThe pre-trained MASt3R is used as the backbone, and its features are used for cost volume construction, RaftStereo + backbone Mast3rin supplemental text.RaftStereoNoCTXRaftStereo model without context network, Baseline w/o mono featurein Table 3 of the main text.RAFTStereoDepthAnyRaftStereo model with our monocular encoder, Baseline + MEin Table 3 of the main text.RAFTStereoDepthFusionRaftStereo model with our monocular encoder, Baseline + ME + IDFin Table 3 of the main text.RAFTStereoDepthBetaRaftStereo model with our monocular encoder and iterative local fusion, Baseline + ME + ILFin Table 3 of the main text.RAFTStereoDepthBetaNoLBPRaftStereo model with our monocular encoder and iterative local fusion without LBPEncoder, L(6)andL(7)in Table 4 of the main text.RAFTStereoDepthMatchRaftStereo model with DepthAnything v2 as feature extractor for cost volume construction, RaftStereo + backbone DepthAnythingin the supplemental text.RAFTStereoDepthPostFusionRaftStereo model with our monocular encoder, iterative local fusion and post fusion, Baseline + ME + PFin Table 3 of the main text.RAFTStereoDepthBetaRefineRaftStereo model with our monocular encoder, iterative local fusion, and global fusion, Baseline + ME + ILF + GFin Table 3 of the main text.variable meaning --lbp_neighbor_offsetscontrol LBP Kernelused in Table 4 of the main text.--modulation_ratiocontrol ramplitude parameter used in Table 4 of the main text.--conf_from_feaCostorHybridforConfidenceused in Table 4 of the main text.--refine_poollearning registration parameters via pooling in the supplemental text. The training is launched by following
bash ./script/train_stereo_raftstereo_depthany.sh EXP_NAMEEXP_NAMEspecifies the experiment name. We use this name to save each log file, tensorboard data, and checkpoint for different experiments. The corresponding file structure is as followsβββ runs βββ ckpoint β βββ RaftStereoDepthAny β βββ RaftStereoMast3r β βββ RaftStereoNoCTX βββ log β βββ RaftStereoDepthAny β βββ RaftStereoMast3r β βββ RaftStereoNoCTX βββ tboard βββ RaftStereoDepthAny βββ RaftStereoMast3r βββ RaftStereoNoCTXβ οΈ Warning: Please follow the training process mentioned in our main text. We first train the model without the global fusion module. Then, we train the monocular registration of the global fusion module while keeping the other modules frozen with a well-trained model from the first stage. Finally, we train the entire global fusion module while keeping the other modules frozen with a well-trained model from the second stage.
Evaluation
The evaluation script is presented in script/evaluate_stereo_raftstereo.sh. We use
--test_exp_nameto specify the evaluation experiment name. The results of each experiment are restored inLOG_ROOT/eval.xlsx. We also merge all experiments' results inLOG_ROOT/merged_eval.xlsxthroughpython3 merge_sheet.py. The evaluation metrics remain the same for different methods. Themean Β± stdis computed via tools/get_statistics.py.Visualization
We visualize the error map via script/gen_sample_stereo_raftstereo.sh and intermediate results via script/vis_inter_stereo_raftstereo.sh. We provide an easy-to-use visualization toolbox to fully understand each module.
Demo
The model weights, pre-trained on SceneFlow, can be downloaded from Google Drive. The demo used to infer disparity maps from custom image pairs is presented in
infer_stereo_raftstereo.py. For specific usage, please refer toscript/infer_stereo_raftstereo.sh.
More Results
The results after using our custom synthetic data Trans Dataset, which is built for multi-label transparent scenes.
| Method | Booster | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ALL | Trans | No_Trans | |||||||||||||||||||
| EPE | RMSE | 2px | 3px | 5px | 6px | 8px | EPE | RMSE | 2px | 3px | 5px | 6px | 8px | EPE | RMSE | 2px | 3px | 5px | 6px | 8px | |
| Ours | 2.26 | 5.60 | 11.02 | 8.59 | 6.60 | 6.00 | 5.35 | 7.93 | 11.03 | 59.83 | 50.36 | 38.44 | 33.87 | 27.56 | 1.52 | 3.93 | 6.98 | 4.97 | 3.64 | 3.27 | 2.89 |
| Ours+Trans | 1.24 | 4.19 | 7.91 | 5.97 | 4.52 | 4.08 | 3.44 | 5.67 | 8.42 | 46.78 | 38.55 | 28.65 | 25.41 | 21.30 | 0.75 | 3.07 | 4.77 | 3.23 | 2.29 | 2.01 | 1.59 |
| Method | Booster | |||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Class 0 | Class 1 | Class 2 | Class 3 | |||||||||||||||||||||||||
| EPE | RMSE | 2px | 3px | 5px | 6px | 8px | EPE | RMSE | 2px | 3px | 5px | 6px | 8px | EPE | RMSE | 2px | 3px | 5px | 6px | 8px | EPE | RMSE | 2px | 3px | 5px | 6px | 8px | |
| Ours | 0.79 | 3.02 | 5.90 | 4.57 | 3.17 | 2.58 | 1.45 | 1.53 | 4.70 | 12.67 | 7.80 | 4.88 | 3.96 | 3.14 | 5.32 | 6.39 | 23.34 | 17.62 | 13.50 | 12.80 | 12.15 | 7.93 | 11.03 | 59.83 | 50.36 | 38.44 | 33.87 | 27.56 |
| Ours+Trans | 0.75 | 2.99 | 5.15 | 4.08 | 3.00 | 2.59 | 1.73 | 1.40 | 4.74 | 9.17 | 5.63 | 3.80 | 3.37 | 2.86 | 1.62 | 2.26 | 13.51 | 10.23 | 7.40 | 6.50 | 4.93 | 5.67 | 8.42 | 46.78 | 38.55 | 28.65 | 25.41 | 21.30 |