Qwen3-VL-8B WebSight Fine-tuned

This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct on the WebSight dataset for GUI automation tasks.

Model Description

Base Model: Qwen/Qwen3-VL-8B-Instruct
Fine-tuning Method: LoRA (merged)
Dataset: wave-ui/websight-v2
Task: Image-to-click location prediction
Output Format: pyautogui.click(x, y) commands

Usage

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModelForVision2Seq.from_pretrained(
    "Asanshay/qwen3-vl-8b-websight-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "Asanshay/qwen3-vl-8b-websight-merged",
    trust_remote_code=True
)

# Prepare input
image = Image.open("screenshot.png")
prompt = "click the login button"

inputs = processor(
    text=f"<image>\n{prompt}",
    images=image,
    return_tensors="pt"
).to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

result = processor.decode(outputs[0], skip_special_tokens=True)
print(result)  # Output: pyautogui.click(x, y)

Training Details

Training Framework: LLaMA-Factory
Hardware: 8x H100 GPUs
LoRA Config:
- Rank: 64
- Alpha: 128
- Dropout: 0.05
- Target modules: all linear layers

Output Format

The model outputs click coordinates normalized to 1400x800 resolution:

Format: pyautogui.click(x, y)
Example: pyautogui.click(565, 486)

Scale to your screen resolution:

x_actual = int(x_norm * (screen_width / 1400))
y_actual = int(y_norm * (screen_height / 800))

Citation

@misc{qwen3-vl-websight,
  title={Qwen3-VL Fine-tuned for GUI Automation},
  author={Your Name},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/Asanshay/qwen3-vl-8b-websight-merged}}
}

License

Apache 2.0 (inherited from base model)

Downloads last month: 23

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for Asanshay/websight-v2-grounded

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(81)

this model