--- license: mit base_model: Qwen/Qwen2.5-VL-3B-Instruct datasets: - Salesforce/grounding_dataset tags: - vision-language - click-prediction - gui-grounding - diffusion-transformer - flow-matching pipeline_tag: image-to-text --- # Qwen-Click-DiT Vision-Language Model with Diffusion Transformer for GUI Click Prediction. For more, read [here](https://husseinlezzaik.com/tess/qwen-click-dit/). ## Model Description This model predicts click coordinates given a screenshot and natural language instruction. It uses: - **Qwen2.5-VL-3B** as a frozen vision-language backbone - **DiT (Diffusion Transformer)** action head using flow matching ## Quick Start ### Installation ```bash pip install torch transformers accelerate qwen-vl-utils pillow git clone https://github.com/HusseinLezzaik/Qwen-Click-DiT.git cd Qwen-Click-DiT ``` ### Inference ```python import torch from PIL import Image from transformers import AutoProcessor, AutoConfig from qwen_vl_utils import process_vision_info # Clone the repo first to get the model class from src.model import Qwen2_5_VLForClickPrediction # Load model model_id = "TESS-Computer/qwen-click-dit" config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") config.dit_hidden_size = 512 config.dit_num_layers = 6 config.dit_num_heads = 8 config.dit_dropout = 0.1 config.num_inference_steps = 16 model = Qwen2_5_VLForClickPrediction.from_pretrained( model_id, config=config, torch_dtype=torch.bfloat16 ) model = model.to("cuda").eval() processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") # Prepare input image = Image.open("screenshot.png").convert("RGB") prompt = "Click on the search button" messages = [{ "role": "user", "content": [ {"type": "image", "image": image, "min_pixels": 200704, "max_pixels": 401408}, {"type": "text", "text": prompt}, ], }] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True) inputs = processor(text=text, images=image_inputs, videos=video_inputs, return_tensors="pt", **video_kwargs) inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()} # Predict click coordinates with torch.no_grad(): click_xy = model.predict(**inputs) x, y = click_xy[0].cpu().tolist() print(f"Normalized: ({x:.4f}, {y:.4f})") print(f"Pixels: ({int(x * image.width)}, {int(y * image.height)})") ``` See [GitHub repo](https://github.com/HusseinLezzaik/Qwen-Click-DiT) for more examples. ## Training - **Dataset**: [Salesforce/grounding_dataset](https://huggingface.co/datasets/Salesforce/grounding_dataset) - **Samples**: 20,000 - **Epochs**: 3 - **Base Model**: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) ## Architecture | Component | Value | |-----------|-------| | DiT Hidden Size | 512 | | DiT Layers | 6 | | DiT Heads | 8 | | Inference Steps | 16 | ## Citation ```bibtex @misc{lezzaik2026qwenclickdit, title = {Qwen-Click-DiT: Vision-Language Model with Diffusion Transformer for GUI Click Prediction}, author = {Lezzaik, Hussein}, year = {2026}, howpublished = {\url{https://github.com/HusseinLezzaik/Qwen-Click-DiT}}, } ``` ## License MIT