ControlThinker / README.md

Improve model card: add metadata, detailed description, links, and sample usage

8c84ce8 verified 2 months ago

2.78 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-to-image
	---

	## ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning

	[![Paper (arXiv)](https://img.shields.io/badge/Paper-ControlThinker-d32f2f.svg?logo=arXiv)](https://arxiv.org/abs/2506.03596) [![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20HF%20-Model-yellow)](https://huggingface.co/maplebb/ControlThinker) [![Hugging Face Paper](https://img.shields.io/badge/Paper-HF-blue)](https://huggingface.co/papers/2506.03596) [GitHub Repository](https://github.com/maplebb/controlthinker)

	ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm for controllable image generation through visual reasoning. It addresses the semantic gap between input text prompts and target images by leveraging a Multimodal Large Language Model (MLLM) to extract latent semantics from control images. This enriches prompts, significantly enhancing visual quality and semantic consistency in generated images.

	The model was presented in the paper [ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning](https://huggingface.co/papers/2506.03596).

	<p align="center"><img src="https://github.com/maplebb/controlthinker/raw/main/asset/image/teaser.png" width="95%"></p>

	## Usage

	You can use ControlThinker for image generation. Below is a sample usage demonstrating how to generate an image from a text prompt.

	```python
	from inference_solver import FlexARInferenceSolver
	from PIL import Image

	# ****************** Image Generation ******************
	inference_solver = FlexARInferenceSolver(
	model_path="maplebb/ControlThinker",
	precision="bf16",
	target_size=768,
	)

	q1 = f"Generate an image of 768x768 according to the following prompt:
	" \
	f"Image of a dog playing water, and a waterfall is in the background."

	# generated: tuple of (generated response, list of generated images)
	generated = inference_solver.generate(
	images=[],
	qas=[[q1, None]],
	max_gen_len=8192,
	temperature=1.0,
	logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
	)

	a1, new_image = generated[0], generated[1][0]

	# You can save and display the generated image
	new_image.save("generated_dog.png")
	new_image.show()
	```

	## License

	ControlThinker is licensed under the Apache 2.0.

	## ✍️ Citation

	```bibtex
	@article{han2025controlthinker,
	title={ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning},
	author={Han, Feng and Jiao, Yang and Chen, Shaoxiang and Xu, Junhao and Chen, Jingjing and Jiang, Yu-Gang},
	journal={arXiv preprint arXiv:2506.03596},
	year={2025}
	}
	```