DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

πŸ“„ Paper Link | πŸ“– Project Page | πŸ’» GitHub Code | πŸ€— Huggingface Demo

This repository contains the DICEPTION model, a robust generalist perception model capable of addressing multiple visual tasks with high efficiency. It leverages text-to-image diffusion models pre-trained on billions of images to achieve performance comparable to state-of-the-art single-task specialist models, with significantly lower computational costs and data requirements.

Abstract

This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models.

πŸ“° News

  • 2025-09-21: πŸš€ Model and inference code released
  • 2025-09-19: 🌟 Accepted as NeurIPS 2025 Spotlight
  • 2025-02-25: πŸ“ Paper released

πŸ› οΈ Installation

conda create -n diception python=3.10 -y

conda activate diception

pip install -r requirements.txt

πŸ‘Ύ Inference

⚑ Quick Start

🧩 Model Setup

  1. Download SD3 Base Model: Download the Stable Diffusion 3 medium model from: https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers

  2. Download Trained Weights: Please download the model from Hugging Face: https://huggingface.co/Canyu/DICEPTION

  3. Update Paths: Set --pretrained_model_path to your SD3 path, and set --diception_path to the local path of the downloaded DICEPTION_v1.pth.

  4. Sample JSON for Batch Inference: We provide several JSON examples for batch inference in the DATA/jsons/evaluate directory.

▢️ Option 1: Simple Inference Script

For single image inference:

python inference.py \
    --image path/to/your/image.jpg \
    --prompt "[[image2depth]]" \
    --pretrained_model_path PATH_TO_SD3 \
    --diception_path PATH_TO_DICEPTION_v1.PTH \
    --output_dir ./outputs \
    --guidance_scale 2 \
    --num_inference_steps 28

With coordinate points (for interactive segmentation):

python inference.py \
    --image path/to/your/image.jpg \
    --prompt "[[image2segmentation]]" \
    --pretrained_model_path PATH_TO_SD3 \
    --diception_path PATH_TO_DICEPTION_v1.PTH \
    --output_dir ./outputs \
    --guidance_scale 2 \
    --num_inference_steps 28 \
    --points "0.3,0.5;0.7,0.2"

The --points parameter accepts coordinates in format "y1,x1;y2,x2;y3,x3" where:

  • Coordinates are normalized to [0,1] range
  • Format is (y,x) where y=height/image_height, x=width/image_width
  • Multiple points are separated by semicolons
  • Maximum 5 points are supported

πŸ“¦ Option 2: Batch Inference

For batch processing with a JSON dataset:

python batch_inference.py \
    --pretrained_model_path PATH_TO_SD3 \
    --diception_path PATH_TO_DICEPTION_v1.PTH \
    --input_path example_batch.json \
    --data_root_path ./ \
    --save_path ./batch_results \
    --batch_size 4 \
    --guidance_scale 2 \
    --num_inference_steps 28
    # --save_npy (for depth and normal value)

JSON Format for Batch Inference: The input JSON file should contain a list of tasks in the following format:

[
  {
    "input": "path/to/image1.jpg",
    "caption": "[[image2segmentation]]"
  },
  {
    "input": "path/to/image2.jpg", 
    "caption": "[[image2depth]]"
  },
  {
    "input": "path/to/image3.jpg",
    "caption": "[[image2segmentation]]",
    "target": {
      "path": "path/to/sa1b.json"   (For convenience, randomly select a region for point prompt from the GT json)
    }
  }
]

πŸ“‹ Supported Tasks

DICEPTION supports various vision perception tasks:

  • Depth Estimation: [[image2depth]]
  • Surface Normal Estimation: [[image2normal]]
  • Pose Estimation: [[image2pose]]
  • Interactive Segmentation: [[image2segmentation]]
  • Semantic Segmentation: [[image2semantic]] + (category in coco), e.g. [[image2semantic]] person
  • Entity Segmentation: [[image2entity]]

πŸ’‘ Inference Tips

  • General settings: For best overall results, use --num_inference_steps 28 and --guidance_scale 2.0.
  • 1-step/few-step inference: We found flow-matching diffusion models naturally support few-step inference, especially for tasks like depth and surface normals. DICEPTION can run with --num_inference_steps 1 and --guidance_scale 1.0 with barely quality loss. If you prioritize speed, consider this setting. We provide a detailed analysis in our NeurIPS paper.

πŸ—ΊοΈ Plan

  • Release inference code and pretrained model v1
  • Release training code
  • Release few-shot finetuning code

🎫 License

For academic use, this project is licensed under the 2-clause BSD License. For commercial use, please contact Chunhua Shen.

πŸ–ŠοΈ Citation

@article{zhao2025diception,
  title={Diception: A generalist diffusion model for visual perceptual tasks},
  author={Zhao, Canyu and Liu, Mingyu and Zheng, Huanyi and Zhu, Muzhi and Zhao, Zhiyue and Chen, Hao and He, Tong and Shen, Chunhua},
  journal={arXiv preprint arXiv:2502.17157},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support