File size: 11,754 Bytes
a5e6506
 
 
79ccf3a
 
54eca88
 
 
 
 
 
 
 
79ccf3a
54eca88
 
 
79ccf3a
54eca88
 
e2ea72f
 
 
54eca88
 
 
 
 
79ccf3a
 
 
 
54eca88
79ccf3a
 
 
 
 
54eca88
 
 
 
79ccf3a
 
 
54eca88
 
 
313bba5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ccf3a
54eca88
 
 
 
 
 
 
 
 
79ccf3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54eca88
79ccf3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2ea72f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ccf3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54eca88
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
license: cc-by-nc-sa-4.0
---
# ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

<div align="center">

[![arXiv Paper](https://img.shields.io/badge/arXiv-2407.15886%20(base)-B31B1B?style=flat&logo=arXiv)](https://arxiv.org/abs/2401.01456)
[![WACV 2025](https://img.shields.io/badge/WACV%202025-v1-0CA4A5?style=flat&logo=Semantic%20Web)](https://openaccess.thecvf.com/content/WACV2025/html/Yan_ColorizeDiffusion_Improving_Reference-Based_Sketch_Colorization_with_Latent_Diffusion_Model_WACV_2025_paper.html)
[![arXiv v1.5 Paper](https://img.shields.io/badge/arXiv-2502.19937%20(v1.5)-B31B1B?style=flat&logo=arXiv)](https://arxiv.org/abs/2502.19937)
[![arXiv v2 Paper](https://img.shields.io/badge/arXiv-2504.06895%20(v2)-B31B1B?style=flat&logo=arXiv)](https://arxiv.org/abs/2504.06895)
[![Model Weights](https://img.shields.io/badge/Hugging%20Face-Model%20Weights-FF9D00?style=flat&logo=Hugging%20Face)](https://huggingface.co/tellurion/ColorizeDiffusion/tree/main)
[![License](https://img.shields.io/badge/License-CC--BY--NC--SA%204.0-4CAF50?style=flat&logo=Creative%20Commons)](https://github.com/tellurion-kanata/colorizeDiffusion/blob/master/LICENSE)

</div>

![img](assets/teaser.png)

(April. 2025)
Official implementation of Colorize Diffusion.  

Colorize Diffusion is a SD-based colorization framework that can achieve high-quality colorization results with arbitrary input pairs.

Fundamental issue for this repository: [ColorizeDiffusion (e-print)](https://arxiv.org/abs/2401.01456).  
***Version 1*** - Base training, 512px. Released, ckpt starts with **mult**.  
***Version 1.5*** - Solving spatial entanglement, 512px. Released, ckpt starts with **switch**.  
***Version 2*** - Enhancing background and style transfer, 768px. Released, ckpt starts with **v2**.  
***Version XL*** - Enhancing embedding guidance for character colorization, geometry disentanglement, 1024px. Available soon.  


## Getting Start

-------------------------------------------------------------------------------------------
```shell
conda env create -f environment.yaml
conda activate hf
```

## User Interface

-------------------------------------------------------------------------------------------
We implement a fully-featured UI. To run it, just:
```shell
python -u app.py
```
The default server address is http://localhost:7860.

#### Important inference options
| Options               | Description                                                                                       |
|:----------------------|:--------------------------------------------------------------------------------------------------|
| BG enhance            | Low-level feature injection for v2 models.                                                        |
| FG enhance            | Useless for currently open-sourced models.                                                        |
| Reference strength    | Decreasing it to increase semantic fidelity to sketch inputs.                                     |
| Foreground strength   | Similar to reference strength but only for foreground region. Need to activate FG or BG enhance.  |
| Preprocessor          | Sketch preprocessing. **Extract** is suggested if the sketch input is complicated pencil drawing. |
| Line extractor        | Line extractors used when preprocessor is **Extract**.                                            |
| Sketch guidance scale | Classifier-free guidance scale of the sketch image, suggested 1.                                  |
| Attention injection   | Noised low-level feature injection, 2x inference time.                                            |


### 768-level Cross-content colorization results (from v2)
![img](assets/cross-1.png)
![img](assets/cross-2.png)
### 1536-level Character colorization results (from XL)
![img](assets/disentanglement2.png)
![img](assets/demon.png)

## Manipulation

-------------------------------------------------------------------------------------------
The colorization results can be manipulated using text prompts, see [ColorizeDiffusion (e-print)](https://arxiv.org/abs/2401.01456).

It is now deactivated by default. To activate it, use
```shell
python -u app.py -manipulate
```

For local manipulations, a visualization is provided to show the correlation between each prompt and tokens in the reference image.


The manipulation result and correlation visualization of the settings:
    
    Target prompt: the girl's blonde hair
    Anchor prompt the girl's brown hair
    Control prompt the girl's brown hair, 
    Target scale: 8
    Enhanced: false
    Thresholds: 0.5、0.55、0.65、0.95

![img](assets/preview1.png)
![img](assets/preview2.png)
As you can see, the manipluation unavoidably changed some unrelated regions as it is taken on the reference embeddings.

#### Manipulation options
| Options                   | Description                                                                                                                                                                                                       |
| :-----                    |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Group index               | The index of selected manipulation sequences's parameter group.                                                                                                                                                   |
| Target prompt             | The prompt used to specify the desired visual attribute for the image after manipulation.                                                                                                                         |
| Anchor prompt             | The prompt to specify the anchored visaul attribute for the image before manipulation.                                                                                                                            |
| Control prompt            | Used for local manipulation (crossattn-based models). The prompt to specify the target regions.                                                                                                                   |
| Enhance                   | Specify whether this manipulation should be enhanced or not. (More likely to influence unrelated attribute).                                                                                                      |
| Target scale              | The scale used to progressively control the manipulation.                                                                                                                                                         |
| Thresholds                | Used for local manipulation (crossattn-based models). Four hyperparameters used to reduce the influnece on irrelevant visual attributes, where 0.0 < threshold 0 < threshold 1 < threshold 2 < threshold 3 < 1.0. |
| \<Threshold0 				| Select regions most related to control prompt. Indicated by deep blue.                                                                                                                                            |
| Threshold0-Threshold1     | Select regions related to control prompt. Indicated by blue.                                                                                                                                                      |
| Threshold1-Threshold2		| Select neighbouring but unrelated regions. Indicated by green.                                                                                                                                                    |
| Threshold2-Threshold3		| Select unrelated regions. Indicated by orange.                                                                                                                                                                    |
| \>Threshold3				| Select most unrelated regions. Indicated by brown.                                                                                                                                                                |
|Add| Click add to save current manipulation in the sequence.        |  


## Training
Our implementation is based on Accelerate and Deepspeed.  
Before starting a training, first collect data and organize your training dataset as follows:

```
[dataset_path]
β”œβ”€β”€ image_list.json    # Optionally for image indexing
β”œβ”€β”€ color/             # Color images
β”‚   β”œβ”€β”€ 0001.zip        
|   |   β”œβ”€β”€ 10001.png
|   |   β”œβ”€β”€ 100001.jpg
β”‚   |   └── ...
β”‚   β”œβ”€β”€ 0002.zip
β”‚   └── ...
β”œβ”€β”€ sketch             # Sketch images
β”‚   β”œβ”€β”€ 0001.zip
|   |   β”œβ”€β”€ 10001.png
|   |   β”œβ”€β”€ 100001.jpg
β”‚   |   └── ...
β”‚   β”œβ”€β”€ 0002.zip
β”‚   └── ...
└── mask               # Mask images (required for fg-bg training)
    β”œβ”€β”€ 0001.zip
    |   β”œβ”€β”€ 10001.png
    |   β”œβ”€β”€ 100001.jpg
    |   └── ...
    β”œβ”€β”€ 0002.zip
    └── ...
```
For details of dataset organization, check `data/dataloader.py`.  
Training command example:
```
accelerate launch --config_file [accelerate_config_file] \
    train.py \
    --name base \
    --dataroot [dataset_path] \
    --batch_size 64 \
    --num_threads 8 \
    -cfg configs/train/sd2.1/mult.yaml \
    -pt [pretrained_model_path]
```
Refer to `options.py` for training/inference/validation arguments.  
Note that the `batch size` here is micro batch size per gpu. If you run the command on 8 gpus, the total batch size is 512.  


## Code reference
1. [Stable Diffusion v2](https://github.com/Stability-AI/stablediffusion)
2. [Stable Diffusion XL](https://github.com/Stability-AI/generative-models)
3. [SD-webui-ControlNet](https://github.com/Mikubill/sd-webui-controlnet)
4. [Stable-Diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
5. [K-diffusion](https://github.com/crowsonkb/k-diffusion)
6. [Deepspeed](https://github.com/microsoft/DeepSpeed)
7. [sketchKeras-PyTorch](https://github.com/higumax/sketchKeras-pytorch)

## Citation
```
@article{2024arXiv240101456Y,
       author = {{Yan}, Dingkun and {Yuan}, Liang and {Wu}, Erwin and {Nishioka}, Yuma and {Fujishiro}, Issei and {Saito}, Suguru},
        title = "{ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text}",
      journal = {arXiv e-prints},
         year = {2024},
          doi = {10.48550/arXiv.2401.01456},
}

@InProceedings{Yan_2025_WACV,
    author    = {Yan, Dingkun and Yuan, Liang and Wu, Erwin and Nishioka, Yuma and Fujishiro, Issei and Saito, Suguru},
    title     = {ColorizeDiffusion: Improving Reference-Based Sketch Colorization with Latent Diffusion Model},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    year      = {2025},
    pages     = {5092-5102}
}

@article{2025arXiv250219937Y,
    author = {{Yan}, Dingkun and {Wang}, Xinrui and {Li}, Zhuoru and {Saito}, Suguru and {Iwasawa}, Yusuke and {Matsuo}, Yutaka and {Guo}, Jiaxian},
    title = "{Image Referenced Sketch Colorization Based on Animation Creation Workflow}",
    journal = {arXiv e-prints},
    year = {2025},
    doi = {10.48550/arXiv.2502.19937},
}

@article{yan2025colorizediffusionv2enhancingreferencebased,
      title={ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities}, 
      author={Dingkun Yan and Xinrui Wang and Yusuke Iwasawa and Yutaka Matsuo and Suguru Saito and Jiaxian Guo},
      year={2025},
      journal = {arXiv e-prints},
      doi = {10.48550/arXiv.2504.06895},
}