tess-500m / README.md
HusseinLezzaik's picture
Upload README.md with huggingface_hub
c3ad5aa verified
metadata
license: apache-2.0
language:
  - en
tags:
  - computer-use
  - gui-agent
  - vision-language-model
  - screen-understanding
  - vla
datasets:
  - TESS-Computer/tess-agentnet
base_model: HuggingFaceTB/SmolVLM2-500M-Instruct
pipeline_tag: image-text-to-text

TESS-500M

TESS is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).

Model Description

  • Base Model: SmolVLM2-500M-Instruct
  • Architecture: SmolVLM + Router + Mouse/Keyboard heads
  • Parameters: 508M total, 48M trainable
  • Training Data: tess-agentnet (~312K samples)

Usage

import torch
from PIL import Image

# Clone the TESS repo
# git clone https://github.com/husseinlezzaik/TESS.git
# cd TESS/model

from test_checkpoint import load_model, predict

# Load model
model, processor = load_model("path/to/checkpoint.pt", device="cuda")

# Run inference
image = Image.open("screenshot.png")
result = predict(model, processor, image, "Click the search button")

print(result)
# Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
# Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}

Output Format

Mouse actions:

{
    'action_type': 'mouse',
    'xy': [x, y],  # Normalized coordinates (0-1)
    'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ...
}

Keyboard actions:

{
    'action_type': 'keyboard',
    'action': 'type' | 'press' | 'hotkey',
    'value': 'text to type' | '<ENTER>' | '<SUPER+C>'
}

Architecture

Screenshot + Instruction β†’ SmolVLM2 β†’ Shared MLP β†’ Router
                                                    ↓
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    ↓                               ↓
                              Mouse Branch                   Keyboard Branch
                              (XY + Click heads)            (VLM text generation)

Training

  • Epochs: 3
  • Batch Size: 48
  • Optimizer: AdamW (LR 2e-4 heads, 5e-4 embeddings)
  • Hardware: NVIDIA H100 80GB
  • Training Time: ~8 hours

Limitations

  • Trained primarily on desktop/web screenshots
  • English instructions only
  • May struggle with unusual UI layouts not seen in training

License

Apache 2.0

Citation

@misc{tess2025,
  title={TESS: A Vision-Language-Action Model for Computer Use},
  author={Hussein Lezzaik},
  year={2025},
  url={https://github.com/husseinlezzaik/TESS}
}