|
|
--- |
|
|
license: mit |
|
|
base_model: Qwen/Qwen2.5-VL-3B-Instruct |
|
|
datasets: |
|
|
- Salesforce/grounding_dataset |
|
|
tags: |
|
|
- vision-language |
|
|
- click-prediction |
|
|
- gui-grounding |
|
|
- diffusion-transformer |
|
|
- flow-matching |
|
|
pipeline_tag: image-to-text |
|
|
--- |
|
|
|
|
|
# Qwen-Click-DiT |
|
|
|
|
|
Vision-Language Model with Diffusion Transformer for GUI Click Prediction. For more, read [here](https://husseinlezzaik.com/tess/qwen-click-dit/). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model predicts click coordinates given a screenshot and natural language instruction. It uses: |
|
|
- **Qwen2.5-VL-3B** as a frozen vision-language backbone |
|
|
- **DiT (Diffusion Transformer)** action head using flow matching |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers accelerate qwen-vl-utils pillow |
|
|
git clone https://github.com/HusseinLezzaik/Qwen-Click-DiT.git |
|
|
cd Qwen-Click-DiT |
|
|
``` |
|
|
|
|
|
### Inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoProcessor, AutoConfig |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
# Clone the repo first to get the model class |
|
|
from src.model import Qwen2_5_VLForClickPrediction |
|
|
|
|
|
# Load model |
|
|
model_id = "TESS-Computer/qwen-click-dit" |
|
|
config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") |
|
|
config.dit_hidden_size = 512 |
|
|
config.dit_num_layers = 6 |
|
|
config.dit_num_heads = 8 |
|
|
config.dit_dropout = 0.1 |
|
|
config.num_inference_steps = 16 |
|
|
|
|
|
model = Qwen2_5_VLForClickPrediction.from_pretrained( |
|
|
model_id, config=config, torch_dtype=torch.bfloat16 |
|
|
) |
|
|
model = model.to("cuda").eval() |
|
|
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") |
|
|
|
|
|
# Prepare input |
|
|
image = Image.open("screenshot.png").convert("RGB") |
|
|
prompt = "Click on the search button" |
|
|
|
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image, "min_pixels": 200704, "max_pixels": 401408}, |
|
|
{"type": "text", "text": prompt}, |
|
|
], |
|
|
}] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True) |
|
|
inputs = processor(text=text, images=image_inputs, videos=video_inputs, return_tensors="pt", **video_kwargs) |
|
|
inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()} |
|
|
|
|
|
# Predict click coordinates |
|
|
with torch.no_grad(): |
|
|
click_xy = model.predict(**inputs) |
|
|
|
|
|
x, y = click_xy[0].cpu().tolist() |
|
|
print(f"Normalized: ({x:.4f}, {y:.4f})") |
|
|
print(f"Pixels: ({int(x * image.width)}, {int(y * image.height)})") |
|
|
``` |
|
|
|
|
|
See [GitHub repo](https://github.com/HusseinLezzaik/Qwen-Click-DiT) for more examples. |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Dataset**: [Salesforce/grounding_dataset](https://huggingface.co/datasets/Salesforce/grounding_dataset) |
|
|
- **Samples**: 20,000 |
|
|
- **Epochs**: 3 |
|
|
- **Base Model**: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) |
|
|
|
|
|
## Architecture |
|
|
|
|
|
| Component | Value | |
|
|
|-----------|-------| |
|
|
| DiT Hidden Size | 512 | |
|
|
| DiT Layers | 6 | |
|
|
| DiT Heads | 8 | |
|
|
| Inference Steps | 16 | |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{lezzaik2026qwenclickdit, |
|
|
title = {Qwen-Click-DiT: Vision-Language Model with Diffusion Transformer for GUI Click Prediction}, |
|
|
author = {Lezzaik, Hussein}, |
|
|
year = {2026}, |
|
|
howpublished = {\url{https://github.com/HusseinLezzaik/Qwen-Click-DiT}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|