Canyu
/

DICEPTION

Image-to-Image

Model card Files Files and versions

xet

Community

Improve model card: add metadata, links, and detailed usage

by nielsr HF Staff - opened Oct 10

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+168

-2

Files changed (1) hide show

README.md +168 -2

README.md CHANGED Viewed

@@ -1,7 +1,173 @@
 ---
 license: bsd-2-clause
 ---
-## References
-* [Model Paper](https://arxiv.org/abs/2502.17157)

 ---
 license: bsd-2-clause
+pipeline_tag: image-to-image
+library_name: diffusers
 ---
+<p align="center">
+  <img src="https://github.com/aim-uofa/Diception/raw/main/assets/logo.png" height=200>
+</p>
+<hr>
+<div align="center">
+# DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
+<p align="center">
+  <a href="https://huggingface.co/papers/2502.17157"><b>📄 Paper Link</b></a> |
+  <a href="https://aim-uofa.github.io/Diception/"><b>📖 Project Page</b></a> |
+  <a href="https://github.com/aim-uofa/Diception"><b>💻 GitHub Code</b></a> |
+  <a href="https://huggingface.co/spaces/Canyu/Diception-Demo"><b>🤗 Huggingface Demo</b></a>
+</p>
+</div>
+This repository contains the DICEPTION model, a robust generalist perception model capable of addressing multiple visual tasks with high efficiency. It leverages text-to-image diffusion models pre-trained on billions of images to achieve performance comparable to state-of-the-art single-task specialist models, with significantly lower computational costs and data requirements.
+## Abstract
+This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models.
+## 📰 News
+- 2025-09-21: 🚀 Model and inference code released
+- 2025-09-19: 🌟 Accepted as NeurIPS 2025 Spotlight
+- 2025-02-25: 📝 Paper released
+## 🛠️ Installation
+```bash
+conda create -n diception python=3.10 -y
+conda activate diception
+pip install -r requirements.txt
+```
+## 👾 Inference
+### ⚡ Quick Start
+#### 🧩 Model Setup
+1.  **Download SD3 Base Model**:
+    Download the Stable Diffusion 3 medium model from:
+    https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers
+2.  **Download Trained Weights**:
+    Please download the model from Hugging Face: https://huggingface.co/Canyu/DICEPTION
+3.  **Update Paths**:
+    Set `--pretrained_model_path` to your SD3 path, and set `--diception_path` to the local path of the downloaded `DICEPTION_v1.pth`.
+4.  **Sample JSON for Batch Inference**:
+    We provide several JSON examples for batch inference in the `DATA/jsons/evaluate` directory.
+#### ▶️ Option 1: Simple Inference Script
+For single image inference:
+```bash
+python inference.py \
+    --image path/to/your/image.jpg \
+    --prompt "[[image2depth]]" \
+    --pretrained_model_path PATH_TO_SD3 \
+    --diception_path PATH_TO_DICEPTION_v1.PTH \
+    --output_dir ./outputs \
+    --guidance_scale 2 \
+    --num_inference_steps 28
+```
+**With coordinate points** (for interactive segmentation):
+```bash
+python inference.py \
+    --image path/to/your/image.jpg \
+    --prompt "[[image2segmentation]]" \
+    --pretrained_model_path PATH_TO_SD3 \
+    --diception_path PATH_TO_DICEPTION_v1.PTH \
+    --output_dir ./outputs \
+    --guidance_scale 2 \
+    --num_inference_steps 28 \
+    --points "0.3,0.5;0.7,0.2"
+```
+The `--points` parameter accepts coordinates in format `"y1,x1;y2,x2;y3,x3"` where:
+-   Coordinates are normalized to [0,1] range
+-   Format is (y,x) where y=height/image_height, x=width/image_width
+-   Multiple points are separated by semicolons
+-   Maximum 5 points are supported
+#### 📦 Option 2: Batch Inference
+For batch processing with a JSON dataset:
+```bash
+python batch_inference.py \
+    --pretrained_model_path PATH_TO_SD3 \
+    --diception_path PATH_TO_DICEPTION_v1.PTH \
+    --input_path example_batch.json \
+    --data_root_path ./ \
+    --save_path ./batch_results \
+    --batch_size 4 \
+    --guidance_scale 2 \
+    --num_inference_steps 28
+    # --save_npy (for depth and normal value)
+```
+**JSON Format for Batch Inference**:
+The input JSON file should contain a list of tasks in the following format:
+```json
+[
+  {
+    "input": "path/to/image1.jpg",
+    "caption": "[[image2segmentation]]"
+  },
+  {
+    "input": "path/to/image2.jpg",
+    "caption": "[[image2depth]]"
+  },
+  {
+    "input": "path/to/image3.jpg",
+    "caption": "[[image2segmentation]]",
+    "target": {
+      "path": "path/to/sa1b.json"   (For convenience, randomly select a region for point prompt from the GT json)
+    }
+  }
+]
+```
+### 📋 Supported Tasks
+DICEPTION supports various vision perception tasks:
+-   **Depth Estimation**: `[[image2depth]]`
+-   **Surface Normal Estimation**: `[[image2normal]]`
+-   **Pose Estimation**: `[[image2pose]]`
+-   **Interactive Segmentation**: `[[image2segmentation]]`
+-   **Semantic Segmentation**: `[[image2semantic]] + (category in coco)`, e.g. `[[image2semantic]] person`
+-   **Entity Segmentation**: `[[image2entity]]`
+### 💡 Inference Tips
+-   **General settings**: For best overall results, use `--num_inference_steps 28` and `--guidance_scale 2.0`.
+-   **1-step/few-step inference**: We found flow-matching diffusion models naturally support few-step inference, especially for tasks like depth and surface normals. DICEPTION can run with `--num_inference_steps 1` and `--guidance_scale 1.0` with barely quality loss. If you prioritize speed, consider this setting. We provide a detailed analysis in our NeurIPS paper.
+### 🗺️ Plan
+-   [X] Release inference code and pretrained model v1
+-   [ ] Release training code
+-   [ ] Release few-shot finetuning code
+## 🎫 License
+For academic use, this project is licensed under [the 2-clause BSD License](https://opensource.org/license/bsd-2-clause).
+For commercial use, please contact [Chunhua Shen](mailto:chhshen@gmail.com).
+## 🖊️ Citation
+```
+@article{zhao2025diception,
+  title={Diception: A generalist diffusion model for visual perceptual tasks},
+  author={Zhao, Canyu and Liu, Mingyu and Zheng, Huanyi and Zhu, Muzhi and Zhao, Zhiyue and Chen, Hao and He, Tong and Shen, Chunhua},
+  journal={arXiv preprint arXiv:2502.17157},
+  year={2025}
+}
+```