BFZD233
initial
5b3b0f4

A newer version of the Gradio SDK is available: 6.0.2

Upgrade
metadata
title: Diving Into The Fusion Of Monocular Priors For Generalized Stereo Matching
emoji: 😻
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

[ICCV25] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching

Detailed images can be found at Google Driver

Requirements

conda env create -f envs/environment_GStereo.yaml
conda activate raftstereo

Required Data

β”œβ”€β”€ datasets
    β”œβ”€β”€ sceneflow
        β”œβ”€β”€ driving                                               
        β”‚   β”œβ”€β”€ disparity                                         
        β”‚   β”œβ”€β”€ frames_cleanpass                                  
        β”‚   └── frames_finalpass                                  
        β”œβ”€β”€ flying3d                                              
        β”‚   β”œβ”€β”€ disparity                                         
        β”‚   β”œβ”€β”€ frames_cleanpass                                  
        β”‚   └── frames_finalpass                                  
        └── monkaa                                                
            β”œβ”€β”€ disparity                                         
            β”œβ”€β”€ frames_cleanpass                                                                                             
            └── frames_finalpass
    β”œβ”€β”€ Kitti15
        β”œβ”€β”€ testing
        β”‚   β”œβ”€β”€ image_2
        β”‚   └── image_3
        └── training
            β”œβ”€β”€ disp_noc_0
            β”œβ”€β”€ disp_noc_1
            β”œβ”€β”€ disp_occ_0
            β”œβ”€β”€ disp_occ_1
            β”œβ”€β”€ flow_noc
            β”œβ”€β”€ flow_occ
            β”œβ”€β”€ image_2
            β”œβ”€β”€ image_3
            └── obj_map
    β”œβ”€β”€ Kitti12
        β”œβ”€β”€ testing
        β”‚   β”œβ”€β”€ calib
        β”‚   β”œβ”€β”€ colored_0
        β”‚   β”œβ”€β”€ colored_1
        β”‚   β”œβ”€β”€ disp_noc
        β”‚   β”œβ”€β”€ disp_occ
        β”‚   β”œβ”€β”€ flow_noc
        β”‚   β”œβ”€β”€ flow_occ
        β”‚   β”œβ”€β”€ image_0
        β”‚   └── image_1
        └── training
            β”œβ”€β”€ calib
            β”œβ”€β”€ colored_0
            └── colored_1
    β”œβ”€β”€ Middlebury
        └── MiddEval3  
            β”œβ”€β”€ testF
            β”œβ”€β”€ testH
            β”œβ”€β”€ testQ    
            β”œβ”€β”€ trainingF                               
            β”œβ”€β”€ trainingH                                         
            └── trainingQ
    β”œβ”€β”€ ETH3D
        β”œβ”€β”€ two_view_testing
        └── two_view_training
            β”œβ”€β”€ delivery_area_1l
            β”œβ”€β”€ delivery_area_1s
            β”œβ”€β”€ delivery_area_2l
    β”œβ”€β”€ Booster
        β”œβ”€β”€ test
        β”‚   β”œβ”€β”€ balanced
        β”‚   └── unbalanced
        └── train
            β”œβ”€β”€ balanced
            └── unbalanced

Code

All codes are provided here, including DepthAnything v2. Since we modified dpt.py to get intermediate features and depth output, please use the modified code.

  • Training

    All training script is presented in script/train_stereo_raftstereo.sh and script/train_stereo_raftstereo_depthany.sh. Please specify the following variable in scripts before training.

    variable meaning
    NCCL_P2P_DISABLE We set NCCL_P2P_DISABLE=1 as the distributed training went wrong at our A40 GPU.
    CUDA_VISIBLE_DEVICES avaliable GPU id, e.g., CUDA_VISIBLE_DEVICES=0,1,2,3
    DATASET_ROOT the training dataset path, e.g., ./datasets/sceneflow
    LOG_ROOT path to save log file
    TB_ROOT path to save tensorboard data
    CKPOINT_ROOT path to save checkpoint

    In order to reproduce our results, please download depth_anything_v2_vitl.pth from DepthAnything v2 before training and specify --depthany_model_dir in script shell to path of directory where depth_anything_v2_vitl.pth is saved. Here, we do not provide the link as it maybe conflicts to the CVPR guideline. We also explain the code for ablation study, in which each experiment is mostly controlled by the --model_name used in the training shell.

    --model_name meaning
    RaftStereo Original RaftStereo model
    RaftStereoDisp The output of GRU is a single channel for disparity instead of two channels for optical flow, Baseline in Table 3 of the main text.
    RAFTStereoMast3r The pre-trained MASt3R is used as the backbone, and its features are used for cost volume construction, RaftStereo + backbone Mast3r in supplemental text.
    RaftStereoNoCTX RaftStereo model without context network, Baseline w/o mono feature in Table 3 of the main text.
    RAFTStereoDepthAny RaftStereo model with our monocular encoder, Baseline + ME in Table 3 of the main text.
    RAFTStereoDepthFusion RaftStereo model with our monocular encoder, Baseline + ME + IDF in Table 3 of the main text.
    RAFTStereoDepthBeta RaftStereo model with our monocular encoder and iterative local fusion, Baseline + ME + ILF in Table 3 of the main text.
    RAFTStereoDepthBetaNoLBP RaftStereo model with our monocular encoder and iterative local fusion without LBPEncoder, L(6) and L(7) in Table 4 of the main text.
    RAFTStereoDepthMatch RaftStereo model with DepthAnything v2 as feature extractor for cost volume construction, RaftStereo + backbone DepthAnything in the supplemental text.
    RAFTStereoDepthPostFusion RaftStereo model with our monocular encoder, iterative local fusion and post fusion, Baseline + ME + PF in Table 3 of the main text.
    RAFTStereoDepthBetaRefine RaftStereo model with our monocular encoder, iterative local fusion, and global fusion, Baseline + ME + ILF + GF in Table 3 of the main text.
    variable meaning
    --lbp_neighbor_offsets control LBP Kernel used in Table 4 of the main text.
    --modulation_ratio control r amplitude parameter used in Table 4 of the main text.
    --conf_from_fea Cost or Hybrid for Confidence used in Table 4 of the main text.
    --refine_pool learning registration parameters via pooling in the supplemental text.

    The training is launched by following

    bash ./script/train_stereo_raftstereo_depthany.sh EXP_NAME
    

    EXP_NAME specifies the experiment name. We use this name to save each log file, tensorboard data, and checkpoint for different experiments. The corresponding file structure is as follows

    β”œβ”€β”€ runs
        β”œβ”€β”€ ckpoint
        β”‚   β”œβ”€β”€ RaftStereoDepthAny
        β”‚   β”œβ”€β”€ RaftStereoMast3r
        β”‚   └── RaftStereoNoCTX
        β”œβ”€β”€ log
        β”‚   β”œβ”€β”€ RaftStereoDepthAny
        β”‚   β”œβ”€β”€ RaftStereoMast3r
        β”‚   └── RaftStereoNoCTX
        └── tboard
            β”œβ”€β”€ RaftStereoDepthAny
            β”œβ”€β”€ RaftStereoMast3r
            └── RaftStereoNoCTX
    

    ⚠️ Warning: Please follow the training process mentioned in our main text. We first train the model without the global fusion module. Then, we train the monocular registration of the global fusion module while keeping the other modules frozen with a well-trained model from the first stage. Finally, we train the entire global fusion module while keeping the other modules frozen with a well-trained model from the second stage.

  • Evaluation

    The evaluation script is presented in script/evaluate_stereo_raftstereo.sh. We use --test_exp_name to specify the evaluation experiment name. The results of each experiment are restored in LOG_ROOT/eval.xlsx. We also merge all experiments' results in LOG_ROOT/merged_eval.xlsx through python3 merge_sheet.py. The evaluation metrics remain the same for different methods. The mean Β± std is computed via tools/get_statistics.py.

  • Visualization

    We visualize the error map via script/gen_sample_stereo_raftstereo.sh and intermediate results via script/vis_inter_stereo_raftstereo.sh. We provide an easy-to-use visualization toolbox to fully understand each module.

  • Demo

    The model weights, pre-trained on SceneFlow, can be downloaded from Google Drive. The demo used to infer disparity maps from custom image pairs is presented in infer_stereo_raftstereo.py. For specific usage, please refer to script/infer_stereo_raftstereo.sh.

More Results

The results after using our custom synthetic data Trans Dataset, which is built for multi-label transparent scenes.

Method Booster
ALL Trans No_Trans
EPE RMSE 2px 3px 5px 6px 8px EPE RMSE 2px 3px 5px 6px 8px EPE RMSE 2px 3px 5px 6px 8px
Ours 2.26 5.60 11.02 8.59 6.60 6.00 5.35 7.93 11.03 59.83 50.36 38.44 33.87 27.56 1.52 3.93 6.98 4.97 3.64 3.27 2.89
Ours+Trans 1.24 4.19 7.91 5.97 4.52 4.08 3.44 5.67 8.42 46.78 38.55 28.65 25.41 21.30 0.75 3.07 4.77 3.23 2.29 2.01 1.59
Method Booster
Class 0 Class 1 Class 2 Class 3
EPE RMSE 2px 3px 5px 6px 8px EPE RMSE 2px 3px 5px 6px 8px EPE RMSE 2px 3px 5px 6px 8px EPE RMSE 2px 3px 5px 6px 8px
Ours 0.79 3.02 5.90 4.57 3.17 2.58 1.45 1.53 4.70 12.67 7.80 4.88 3.96 3.14 5.32 6.39 23.34 17.62 13.50 12.80 12.15 7.93 11.03 59.83 50.36 38.44 33.87 27.56
Ours+Trans 0.75 2.99 5.15 4.08 3.00 2.59 1.73 1.40 4.74 9.17 5.63 3.80 3.37 2.86 1.62 2.26 13.51 10.23 7.40 6.50 4.93 5.67 8.42 46.78 38.55 28.65 25.41 21.30