Spaces:

BFZD
/

Diving-into-the-Fusion-of-Monocular-Priors-for-Generalized-Stereo-Matching-Demo

Sleeping

App Files Files Community

Diving-into-the-Fusion-of-Monocular-Priors-for-Generalized-Stereo-Matching-Demo / README.md

BFZD233

initial

5b3b0f4 5 months ago

preview code

raw

history blame contribute delete

13.7 kB

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

metadata

title: Diving Into The Fusion Of Monocular Priors For Generalized Stereo Matching
emoji: 😻
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

[ICCV25] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching

Detailed images can be found at Google Driver

Requirements

conda env create -f envs/environment_GStereo.yaml
conda activate raftstereo

Required Data

├── datasets
    ├── sceneflow
        ├── driving                                               
        │   ├── disparity                                         
        │   ├── frames_cleanpass                                  
        │   └── frames_finalpass                                  
        ├── flying3d                                              
        │   ├── disparity                                         
        │   ├── frames_cleanpass                                  
        │   └── frames_finalpass                                  
        └── monkaa                                                
            ├── disparity                                         
            ├── frames_cleanpass                                                                                             
            └── frames_finalpass
    ├── Kitti15
        ├── testing
        │   ├── image_2
        │   └── image_3
        └── training
            ├── disp_noc_0
            ├── disp_noc_1
            ├── disp_occ_0
            ├── disp_occ_1
            ├── flow_noc
            ├── flow_occ
            ├── image_2
            ├── image_3
            └── obj_map
    ├── Kitti12
        ├── testing
        │   ├── calib
        │   ├── colored_0
        │   ├── colored_1
        │   ├── disp_noc
        │   ├── disp_occ
        │   ├── flow_noc
        │   ├── flow_occ
        │   ├── image_0
        │   └── image_1
        └── training
            ├── calib
            ├── colored_0
            └── colored_1
    ├── Middlebury
        └── MiddEval3  
            ├── testF
            ├── testH
            ├── testQ    
            ├── trainingF                               
            ├── trainingH                                         
            └── trainingQ
    ├── ETH3D
        ├── two_view_testing
        └── two_view_training
            ├── delivery_area_1l
            ├── delivery_area_1s
            ├── delivery_area_2l
    ├── Booster
        ├── test
        │   ├── balanced
        │   └── unbalanced
        └── train
            ├── balanced
            └── unbalanced

Code

All codes are provided here, including DepthAnything v2. Since we modified dpt.py to get intermediate features and depth output, please use the modified code.

Training

All training script is presented in script/train_stereo_raftstereo.sh and script/train_stereo_raftstereo_depthany.sh. Please specify the following variable in scripts before training.

variable	meaning
`NCCL_P2P_DISABLE`	We set `NCCL_P2P_DISABLE=1` as the distributed training went wrong at our `A40` GPU.
`CUDA_VISIBLE_DEVICES`	avaliable GPU id, e.g., `CUDA_VISIBLE_DEVICES=0,1,2,3`
`DATASET_ROOT`	the training dataset path, e.g., `./datasets/sceneflow`
`LOG_ROOT`	path to save log file
`TB_ROOT`	path to save tensorboard data
`CKPOINT_ROOT`	path to save checkpoint

In order to reproduce our results, please download depth_anything_v2_vitl.pth from DepthAnything v2 before training and specify --depthany_model_dir in script shell to path of directory where depth_anything_v2_vitl.pth is saved. Here, we do not provide the link as it maybe conflicts to the CVPR guideline. We also explain the code for ablation study, in which each experiment is mostly controlled by the --model_name used in the training shell.

`--model_name`	meaning
`RaftStereo`	Original RaftStereo model
`RaftStereoDisp`	The output of GRU is a single channel for disparity instead of two channels for optical flow, `Baseline` in Table 3 of the main text.
`RAFTStereoMast3r`	The pre-trained MASt3R is used as the backbone, and its features are used for cost volume construction, `RaftStereo + backbone Mast3r` in supplemental text.
`RaftStereoNoCTX`	RaftStereo model without context network, `Baseline w/o mono feature` in Table 3 of the main text.
`RAFTStereoDepthAny`	RaftStereo model with our monocular encoder, `Baseline + ME` in Table 3 of the main text.
`RAFTStereoDepthFusion`	RaftStereo model with our monocular encoder, `Baseline + ME + IDF` in Table 3 of the main text.
`RAFTStereoDepthBeta`	RaftStereo model with our monocular encoder and iterative local fusion, `Baseline + ME + ILF` in Table 3 of the main text.
`RAFTStereoDepthBetaNoLBP`	RaftStereo model with our monocular encoder and iterative local fusion without LBPEncoder, `L(6)` and `L(7)` in Table 4 of the main text.
`RAFTStereoDepthMatch`	RaftStereo model with DepthAnything v2 as feature extractor for cost volume construction, `RaftStereo + backbone DepthAnything` in the supplemental text.
`RAFTStereoDepthPostFusion`	RaftStereo model with our monocular encoder, iterative local fusion and post fusion, `Baseline + ME + PF` in Table 3 of the main text.
`RAFTStereoDepthBetaRefine`	RaftStereo model with our monocular encoder, iterative local fusion, and global fusion, `Baseline + ME + ILF + GF` in Table 3 of the main text.

variable	meaning
`--lbp_neighbor_offsets`	control `LBP Kernel` used in Table 4 of the main text.
`--modulation_ratio`	control `r` amplitude parameter used in Table 4 of the main text.
`--conf_from_fea`	`Cost` or `Hybrid` for `Confidence` used in Table 4 of the main text.
`--refine_pool`	learning registration parameters via pooling in the supplemental text.

The training is launched by following

bash ./script/train_stereo_raftstereo_depthany.sh EXP_NAME

EXP_NAME specifies the experiment name. We use this name to save each log file, tensorboard data, and checkpoint for different experiments. The corresponding file structure is as follows

├── runs
    ├── ckpoint
    │   ├── RaftStereoDepthAny
    │   ├── RaftStereoMast3r
    │   └── RaftStereoNoCTX
    ├── log
    │   ├── RaftStereoDepthAny
    │   ├── RaftStereoMast3r
    │   └── RaftStereoNoCTX
    └── tboard
        ├── RaftStereoDepthAny
        ├── RaftStereoMast3r
        └── RaftStereoNoCTX

⚠️ Warning: Please follow the training process mentioned in our main text. We first train the model without the global fusion module. Then, we train the monocular registration of the global fusion module while keeping the other modules frozen with a well-trained model from the first stage. Finally, we train the entire global fusion module while keeping the other modules frozen with a well-trained model from the second stage.

Evaluation

The evaluation script is presented in script/evaluate_stereo_raftstereo.sh. We use --test_exp_name to specify the evaluation experiment name. The results of each experiment are restored in LOG_ROOT/eval.xlsx. We also merge all experiments' results in LOG_ROOT/merged_eval.xlsx through python3 merge_sheet.py. The evaluation metrics remain the same for different methods. The mean ± std is computed via tools/get_statistics.py.
Visualization

We visualize the error map via script/gen_sample_stereo_raftstereo.sh and intermediate results via script/vis_inter_stereo_raftstereo.sh. We provide an easy-to-use visualization toolbox to fully understand each module.
Demo

The model weights, pre-trained on SceneFlow, can be downloaded from Google Drive. The demo used to infer disparity maps from custom image pairs is presented in infer_stereo_raftstereo.py. For specific usage, please refer to script/infer_stereo_raftstereo.sh.

More Results

The results after using our custom synthetic data Trans Dataset, which is built for multi-label transparent scenes.

Method	Booster
	ALL							Trans							No_Trans
	EPE	RMSE	2px	3px	5px	6px	8px	EPE	RMSE	2px	3px	5px	6px	8px	EPE	RMSE	2px	3px	5px	6px	8px
Ours	2.26	5.60	11.02	8.59	6.60	6.00	5.35	7.93	11.03	59.83	50.36	38.44	33.87	27.56	1.52	3.93	6.98	4.97	3.64	3.27	2.89
Ours+Trans	1.24	4.19	7.91	5.97	4.52	4.08	3.44	5.67	8.42	46.78	38.55	28.65	25.41	21.30	0.75	3.07	4.77	3.23	2.29	2.01	1.59

Method	Booster
	Class 0							Class 1							Class 2							Class 3
	EPE	RMSE	2px	3px	5px	6px	8px	EPE	RMSE	2px	3px	5px	6px	8px	EPE	RMSE	2px	3px	5px	6px	8px	EPE	RMSE	2px	3px	5px	6px	8px
Ours	0.79	3.02	5.90	4.57	3.17	2.58	1.45	1.53	4.70	12.67	7.80	4.88	3.96	3.14	5.32	6.39	23.34	17.62	13.50	12.80	12.15	7.93	11.03	59.83	50.36	38.44	33.87	27.56
Ours+Trans	0.75	2.99	5.15	4.08	3.00	2.59	1.73	1.40	4.74	9.17	5.63	3.80	3.37	2.86	1.62	2.26	13.51	10.23	7.40	6.50	4.93	5.67	8.42	46.78	38.55	28.65	25.41	21.30

[ICCV25] Diving into the Fusion of Monocular Priors for Generalized Stereo Matching

Requirements

Required Data

Code

Training

Evaluation

Visualization

Demo

More Results