Improve model card: add metadata, links, and detailed usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +168 -2
README.md CHANGED
@@ -1,7 +1,173 @@
1
  ---
2
  license: bsd-2-clause
 
 
3
  ---
4
 
5
- ## References
 
 
 
 
 
 
6
 
7
- * [Model Paper](https://arxiv.org/abs/2502.17157)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bsd-2-clause
3
+ pipeline_tag: image-to-image
4
+ library_name: diffusers
5
  ---
6
 
7
+ <p align="center">
8
+ <img src="https://github.com/aim-uofa/Diception/raw/main/assets/logo.png" height=200>
9
+ </p>
10
+ <hr>
11
+ <div align="center">
12
+
13
+ # DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
14
 
15
+ <p align="center">
16
+ <a href="https://huggingface.co/papers/2502.17157"><b>πŸ“„ Paper Link</b></a> |
17
+ <a href="https://aim-uofa.github.io/Diception/"><b>πŸ“– Project Page</b></a> |
18
+ <a href="https://github.com/aim-uofa/Diception"><b>πŸ’» GitHub Code</b></a> |
19
+ <a href="https://huggingface.co/spaces/Canyu/Diception-Demo"><b>πŸ€— Huggingface Demo</b></a>
20
+ </p>
21
+
22
+ </div>
23
+
24
+ This repository contains the DICEPTION model, a robust generalist perception model capable of addressing multiple visual tasks with high efficiency. It leverages text-to-image diffusion models pre-trained on billions of images to achieve performance comparable to state-of-the-art single-task specialist models, with significantly lower computational costs and data requirements.
25
+
26
+ ## Abstract
27
+
28
+ This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models.
29
+
30
+ ## πŸ“° News
31
+
32
+ - 2025-09-21: πŸš€ Model and inference code released
33
+ - 2025-09-19: 🌟 Accepted as NeurIPS 2025 Spotlight
34
+ - 2025-02-25: πŸ“ Paper released
35
+
36
+ ## πŸ› οΈ Installation
37
+ ```bash
38
+ conda create -n diception python=3.10 -y
39
+
40
+ conda activate diception
41
+
42
+ pip install -r requirements.txt
43
+ ```
44
+
45
+ ## πŸ‘Ύ Inference
46
+
47
+ ### ⚑ Quick Start
48
+
49
+ #### 🧩 Model Setup
50
+
51
+ 1. **Download SD3 Base Model**:
52
+ Download the Stable Diffusion 3 medium model from:
53
+ https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers
54
+
55
+ 2. **Download Trained Weights**:
56
+ Please download the model from Hugging Face: https://huggingface.co/Canyu/DICEPTION
57
+
58
+ 3. **Update Paths**:
59
+ Set `--pretrained_model_path` to your SD3 path, and set `--diception_path` to the local path of the downloaded `DICEPTION_v1.pth`.
60
+
61
+ 4. **Sample JSON for Batch Inference**:
62
+ We provide several JSON examples for batch inference in the `DATA/jsons/evaluate` directory.
63
+
64
+
65
+ #### ▢️ Option 1: Simple Inference Script
66
+ For single image inference:
67
+
68
+ ```bash
69
+ python inference.py \
70
+ --image path/to/your/image.jpg \
71
+ --prompt "[[image2depth]]" \
72
+ --pretrained_model_path PATH_TO_SD3 \
73
+ --diception_path PATH_TO_DICEPTION_v1.PTH \
74
+ --output_dir ./outputs \
75
+ --guidance_scale 2 \
76
+ --num_inference_steps 28
77
+ ```
78
+
79
+ **With coordinate points** (for interactive segmentation):
80
+
81
+ ```bash
82
+ python inference.py \
83
+ --image path/to/your/image.jpg \
84
+ --prompt "[[image2segmentation]]" \
85
+ --pretrained_model_path PATH_TO_SD3 \
86
+ --diception_path PATH_TO_DICEPTION_v1.PTH \
87
+ --output_dir ./outputs \
88
+ --guidance_scale 2 \
89
+ --num_inference_steps 28 \
90
+ --points "0.3,0.5;0.7,0.2"
91
+ ```
92
+
93
+ The `--points` parameter accepts coordinates in format `"y1,x1;y2,x2;y3,x3"` where:
94
+ - Coordinates are normalized to [0,1] range
95
+ - Format is (y,x) where y=height/image_height, x=width/image_width
96
+ - Multiple points are separated by semicolons
97
+ - Maximum 5 points are supported
98
+
99
+ #### πŸ“¦ Option 2: Batch Inference
100
+ For batch processing with a JSON dataset:
101
+
102
+ ```bash
103
+ python batch_inference.py \
104
+ --pretrained_model_path PATH_TO_SD3 \
105
+ --diception_path PATH_TO_DICEPTION_v1.PTH \
106
+ --input_path example_batch.json \
107
+ --data_root_path ./ \
108
+ --save_path ./batch_results \
109
+ --batch_size 4 \
110
+ --guidance_scale 2 \
111
+ --num_inference_steps 28
112
+ # --save_npy (for depth and normal value)
113
+ ```
114
+
115
+ **JSON Format for Batch Inference**:
116
+ The input JSON file should contain a list of tasks in the following format:
117
+ ```json
118
+ [
119
+ {
120
+ "input": "path/to/image1.jpg",
121
+ "caption": "[[image2segmentation]]"
122
+ },
123
+ {
124
+ "input": "path/to/image2.jpg",
125
+ "caption": "[[image2depth]]"
126
+ },
127
+ {
128
+ "input": "path/to/image3.jpg",
129
+ "caption": "[[image2segmentation]]",
130
+ "target": {
131
+ "path": "path/to/sa1b.json" (For convenience, randomly select a region for point prompt from the GT json)
132
+ }
133
+ }
134
+ ]
135
+ ```
136
+
137
+ ### πŸ“‹ Supported Tasks
138
+
139
+ DICEPTION supports various vision perception tasks:
140
+ - **Depth Estimation**: `[[image2depth]]`
141
+ - **Surface Normal Estimation**: `[[image2normal]]`
142
+ - **Pose Estimation**: `[[image2pose]]`
143
+ - **Interactive Segmentation**: `[[image2segmentation]]`
144
+ - **Semantic Segmentation**: `[[image2semantic]] + (category in coco)`, e.g. `[[image2semantic]] person`
145
+ - **Entity Segmentation**: `[[image2entity]]`
146
+
147
+
148
+ ### πŸ’‘ Inference Tips
149
+
150
+ - **General settings**: For best overall results, use `--num_inference_steps 28` and `--guidance_scale 2.0`.
151
+ - **1-step/few-step inference**: We found flow-matching diffusion models naturally support few-step inference, especially for tasks like depth and surface normals. DICEPTION can run with `--num_inference_steps 1` and `--guidance_scale 1.0` with barely quality loss. If you prioritize speed, consider this setting. We provide a detailed analysis in our NeurIPS paper.
152
+
153
+
154
+ ### πŸ—ΊοΈ Plan
155
+ - [X] Release inference code and pretrained model v1
156
+ - [ ] Release training code
157
+ - [ ] Release few-shot finetuning code
158
+
159
+
160
+ ## 🎫 License
161
+
162
+ For academic use, this project is licensed under [the 2-clause BSD License](https://opensource.org/license/bsd-2-clause).
163
+ For commercial use, please contact [Chunhua Shen](mailto:chhshen@gmail.com).
164
+
165
+ ## πŸ–ŠοΈ Citation
166
+ ```
167
+ @article{zhao2025diception,
168
+ title={Diception: A generalist diffusion model for visual perceptual tasks},
169
+ author={Zhao, Canyu and Liu, Mingyu and Zheng, Huanyi and Zhu, Muzhi and Zhao, Zhiyue and Chen, Hao and He, Tong and Shen, Chunhua},
170
+ journal={arXiv preprint arXiv:2502.17157},
171
+ year={2025}
172
+ }
173
+ ```