|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: text-to-image |
|
|
--- |
|
|
|
|
|
## ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning |
|
|
|
|
|
[](https://arxiv.org/abs/2506.03596) [](https://huggingface.co/maplebb/ControlThinker) [](https://huggingface.co/papers/2506.03596) [GitHub Repository](https://github.com/maplebb/controlthinker) |
|
|
|
|
|
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm for controllable image generation through visual reasoning. It addresses the semantic gap between input text prompts and target images by leveraging a Multimodal Large Language Model (MLLM) to extract latent semantics from control images. This enriches prompts, significantly enhancing visual quality and semantic consistency in generated images. |
|
|
|
|
|
The model was presented in the paper [ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning](https://huggingface.co/papers/2506.03596). |
|
|
|
|
|
<p align="center"><img src="https://github.com/maplebb/controlthinker/raw/main/asset/image/teaser.png" width="95%"></p> |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use ControlThinker for image generation. Below is a sample usage demonstrating how to generate an image from a text prompt. |
|
|
|
|
|
```python |
|
|
from inference_solver import FlexARInferenceSolver |
|
|
from PIL import Image |
|
|
|
|
|
# ******************** Image Generation ******************** |
|
|
inference_solver = FlexARInferenceSolver( |
|
|
model_path="maplebb/ControlThinker", |
|
|
precision="bf16", |
|
|
target_size=768, |
|
|
) |
|
|
|
|
|
q1 = f"Generate an image of 768x768 according to the following prompt: |
|
|
" \ |
|
|
f"Image of a dog playing water, and a waterfall is in the background." |
|
|
|
|
|
# generated: tuple of (generated response, list of generated images) |
|
|
generated = inference_solver.generate( |
|
|
images=[], |
|
|
qas=[[q1, None]], |
|
|
max_gen_len=8192, |
|
|
temperature=1.0, |
|
|
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000), |
|
|
) |
|
|
|
|
|
a1, new_image = generated[0], generated[1][0] |
|
|
|
|
|
# You can save and display the generated image |
|
|
new_image.save("generated_dog.png") |
|
|
new_image.show() |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
ControlThinker is licensed under the Apache 2.0. |
|
|
|
|
|
## ✍️ Citation |
|
|
|
|
|
```bibtex |
|
|
@article{han2025controlthinker, |
|
|
title={ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning}, |
|
|
author={Han, Feng and Jiao, Yang and Chen, Shaoxiang and Xu, Junhao and Chen, Jingjing and Jiang, Yu-Gang}, |
|
|
journal={arXiv preprint arXiv:2506.03596}, |
|
|
year={2025} |
|
|
} |
|
|
``` |