Qwen3-VL-8B WebSight Fine-tuned
This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct on the WebSight dataset for GUI automation tasks.
Model Description
- Base Model: Qwen/Qwen3-VL-8B-Instruct
- Fine-tuning Method: LoRA (merged)
- Dataset: wave-ui/websight-v2
- Task: Image-to-click location prediction
- Output Format:
pyautogui.click(x, y)commands
Usage
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model = AutoModelForVision2Seq.from_pretrained(
"Asanshay/qwen3-vl-8b-websight-merged",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"Asanshay/qwen3-vl-8b-websight-merged",
trust_remote_code=True
)
# Prepare input
image = Image.open("screenshot.png")
prompt = "click the login button"
inputs = processor(
text=f"<image>\n{prompt}",
images=image,
return_tensors="pt"
).to(model.device)
# Generate
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result) # Output: pyautogui.click(x, y)
Training Details
- Training Framework: LLaMA-Factory
- Hardware: 8x H100 GPUs
- LoRA Config:
- Rank: 64
- Alpha: 128
- Dropout: 0.05
- Target modules: all linear layers
Output Format
The model outputs click coordinates normalized to 1400x800 resolution:
- Format:
pyautogui.click(x, y) - Example:
pyautogui.click(565, 486)
Scale to your screen resolution:
x_actual = int(x_norm * (screen_width / 1400))
y_actual = int(y_norm * (screen_height / 800))
Citation
@misc{qwen3-vl-websight,
title={Qwen3-VL Fine-tuned for GUI Automation},
author={Your Name},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/Asanshay/qwen3-vl-8b-websight-merged}}
}
License
Apache 2.0 (inherited from base model)
- Downloads last month
- 23
Model tree for Asanshay/websight-v2-grounded
Base model
Qwen/Qwen3-VL-8B-Instruct