qwen-click-dit / README.md
HusseinLezzaik's picture
Update README.md
657ef13 verified
---
license: mit
base_model: Qwen/Qwen2.5-VL-3B-Instruct
datasets:
- Salesforce/grounding_dataset
tags:
- vision-language
- click-prediction
- gui-grounding
- diffusion-transformer
- flow-matching
pipeline_tag: image-to-text
---
# Qwen-Click-DiT
Vision-Language Model with Diffusion Transformer for GUI Click Prediction. For more, read [here](https://husseinlezzaik.com/tess/qwen-click-dit/).
## Model Description
This model predicts click coordinates given a screenshot and natural language instruction. It uses:
- **Qwen2.5-VL-3B** as a frozen vision-language backbone
- **DiT (Diffusion Transformer)** action head using flow matching
## Quick Start
### Installation
```bash
pip install torch transformers accelerate qwen-vl-utils pillow
git clone https://github.com/HusseinLezzaik/Qwen-Click-DiT.git
cd Qwen-Click-DiT
```
### Inference
```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoConfig
from qwen_vl_utils import process_vision_info
# Clone the repo first to get the model class
from src.model import Qwen2_5_VLForClickPrediction
# Load model
model_id = "TESS-Computer/qwen-click-dit"
config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
config.dit_hidden_size = 512
config.dit_num_layers = 6
config.dit_num_heads = 8
config.dit_dropout = 0.1
config.num_inference_steps = 16
model = Qwen2_5_VLForClickPrediction.from_pretrained(
model_id, config=config, torch_dtype=torch.bfloat16
)
model = model.to("cuda").eval()
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
# Prepare input
image = Image.open("screenshot.png").convert("RGB")
prompt = "Click on the search button"
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image, "min_pixels": 200704, "max_pixels": 401408},
{"type": "text", "text": prompt},
],
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text=text, images=image_inputs, videos=video_inputs, return_tensors="pt", **video_kwargs)
inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()}
# Predict click coordinates
with torch.no_grad():
click_xy = model.predict(**inputs)
x, y = click_xy[0].cpu().tolist()
print(f"Normalized: ({x:.4f}, {y:.4f})")
print(f"Pixels: ({int(x * image.width)}, {int(y * image.height)})")
```
See [GitHub repo](https://github.com/HusseinLezzaik/Qwen-Click-DiT) for more examples.
## Training
- **Dataset**: [Salesforce/grounding_dataset](https://huggingface.co/datasets/Salesforce/grounding_dataset)
- **Samples**: 20,000
- **Epochs**: 3
- **Base Model**: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
## Architecture
| Component | Value |
|-----------|-------|
| DiT Hidden Size | 512 |
| DiT Layers | 6 |
| DiT Heads | 8 |
| Inference Steps | 16 |
## Citation
```bibtex
@misc{lezzaik2026qwenclickdit,
title = {Qwen-Click-DiT: Vision-Language Model with Diffusion Transformer for GUI Click Prediction},
author = {Lezzaik, Hussein},
year = {2026},
howpublished = {\url{https://github.com/HusseinLezzaik/Qwen-Click-DiT}},
}
```
## License
MIT