DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
π Paper Link | π Project Page | π» GitHub Code | π€ Huggingface Demo
This repository contains the DICEPTION model, a robust generalist perception model capable of addressing multiple visual tasks with high efficiency. It leverages text-to-image diffusion models pre-trained on billions of images to achieve performance comparable to state-of-the-art single-task specialist models, with significantly lower computational costs and data requirements.
Abstract
This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models.
π° News
- 2025-09-21: π Model and inference code released
- 2025-09-19: π Accepted as NeurIPS 2025 Spotlight
- 2025-02-25: π Paper released
π οΈ Installation
conda create -n diception python=3.10 -y
conda activate diception
pip install -r requirements.txt
πΎ Inference
β‘ Quick Start
π§© Model Setup
Download SD3 Base Model: Download the Stable Diffusion 3 medium model from: https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers
Download Trained Weights: Please download the model from Hugging Face: https://huggingface.co/Canyu/DICEPTION
Update Paths: Set
--pretrained_model_pathto your SD3 path, and set--diception_pathto the local path of the downloadedDICEPTION_v1.pth.Sample JSON for Batch Inference: We provide several JSON examples for batch inference in the
DATA/jsons/evaluatedirectory.
βΆοΈ Option 1: Simple Inference Script
For single image inference:
python inference.py \
--image path/to/your/image.jpg \
--prompt "[[image2depth]]" \
--pretrained_model_path PATH_TO_SD3 \
--diception_path PATH_TO_DICEPTION_v1.PTH \
--output_dir ./outputs \
--guidance_scale 2 \
--num_inference_steps 28
With coordinate points (for interactive segmentation):
python inference.py \
--image path/to/your/image.jpg \
--prompt "[[image2segmentation]]" \
--pretrained_model_path PATH_TO_SD3 \
--diception_path PATH_TO_DICEPTION_v1.PTH \
--output_dir ./outputs \
--guidance_scale 2 \
--num_inference_steps 28 \
--points "0.3,0.5;0.7,0.2"
The --points parameter accepts coordinates in format "y1,x1;y2,x2;y3,x3" where:
- Coordinates are normalized to [0,1] range
- Format is (y,x) where y=height/image_height, x=width/image_width
- Multiple points are separated by semicolons
- Maximum 5 points are supported
π¦ Option 2: Batch Inference
For batch processing with a JSON dataset:
python batch_inference.py \
--pretrained_model_path PATH_TO_SD3 \
--diception_path PATH_TO_DICEPTION_v1.PTH \
--input_path example_batch.json \
--data_root_path ./ \
--save_path ./batch_results \
--batch_size 4 \
--guidance_scale 2 \
--num_inference_steps 28
# --save_npy (for depth and normal value)
JSON Format for Batch Inference: The input JSON file should contain a list of tasks in the following format:
[
{
"input": "path/to/image1.jpg",
"caption": "[[image2segmentation]]"
},
{
"input": "path/to/image2.jpg",
"caption": "[[image2depth]]"
},
{
"input": "path/to/image3.jpg",
"caption": "[[image2segmentation]]",
"target": {
"path": "path/to/sa1b.json" (For convenience, randomly select a region for point prompt from the GT json)
}
}
]
π Supported Tasks
DICEPTION supports various vision perception tasks:
- Depth Estimation:
[[image2depth]] - Surface Normal Estimation:
[[image2normal]] - Pose Estimation:
[[image2pose]] - Interactive Segmentation:
[[image2segmentation]] - Semantic Segmentation:
[[image2semantic]] + (category in coco), e.g.[[image2semantic]] person - Entity Segmentation:
[[image2entity]]
π‘ Inference Tips
- General settings: For best overall results, use
--num_inference_steps 28and--guidance_scale 2.0. - 1-step/few-step inference: We found flow-matching diffusion models naturally support few-step inference, especially for tasks like depth and surface normals. DICEPTION can run with
--num_inference_steps 1and--guidance_scale 1.0with barely quality loss. If you prioritize speed, consider this setting. We provide a detailed analysis in our NeurIPS paper.
πΊοΈ Plan
- Release inference code and pretrained model v1
- Release training code
- Release few-shot finetuning code
π« License
For academic use, this project is licensed under the 2-clause BSD License. For commercial use, please contact Chunhua Shen.
ποΈ Citation
@article{zhao2025diception,
title={Diception: A generalist diffusion model for visual perceptual tasks},
author={Zhao, Canyu and Liu, Mingyu and Zheng, Huanyi and Zhu, Muzhi and Zhao, Zhiyue and Chen, Hao and He, Tong and Shen, Chunhua},
journal={arXiv preprint arXiv:2502.17157},
year={2025}
}