TESS-Computer
/

tess-500m

+---
+license: apache-2.0
+language:
+- en
+tags:
+- computer-use
+- gui-agent
+- vision-language-model
+- screen-understanding
+datasets:
+- TESS-Computer/agentnet
+base_model: HuggingFaceTB/SmolVLM2-500M-Instruct
+pipeline_tag: image-text-to-text
+---
+# TESS-500M
+**TESS (Text-Enabled Screen Sense)** is a Vision-Language-Action model for computer use. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).
+## Model Description
+- **Base Model**: SmolVLM2-500M-Instruct
+- **Architecture**: SmolVLM + Router + Mouse/Keyboard heads
+- **Parameters**: 508M total, 48M trainable
+- **Training Data**: [AgentNet](https://huggingface.co/datasets/TESS-Computer/agentnet) (~312K samples)
+## Usage
+```python
+import torch
+from PIL import Image
+# Clone the TESS repo
+# git clone https://github.com/yourusername/TESS.git
+# cd TESS/model
+from test_checkpoint import load_model, predict
+# Load model
+model, processor = load_model("path/to/checkpoint.pt", device="cuda")
+# Run inference
+image = Image.open("screenshot.png")
+result = predict(model, processor, image, "Click the search button")
+print(result)
+# Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
+# Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}
+```
+## Output Format
+**Mouse actions:**
+```python
+{
+    'action_type': 'mouse',
+    'xy': [x, y],  # Normalized coordinates (0-1)
+    'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ...
+}
+```
+**Keyboard actions:**
+```python
+{
+    'action_type': 'keyboard',
+    'action': 'type' | 'press' | 'hotkey',
+    'value': 'text to type' | '<ENTER>' | '<SUPER+C>'
+}
+```
+## Architecture
+```
+Screenshot + Instruction → SmolVLM2 → Shared MLP → Router
+                                                    ↓
+                                    ┌───────────────┴───────────────┐
+                                    ↓                               ↓
+                              Mouse Branch                   Keyboard Branch
+                              (XY + Click heads)            (VLM text generation)
+```
+## Training
+- **Epochs**: 3
+- **Batch Size**: 48
+- **Optimizer**: AdamW (LR 2e-4 heads, 5e-4 embeddings)
+- **Hardware**: NVIDIA H100 80GB
+- **Training Time**: ~8 hours
+## Limitations
+- Trained primarily on desktop/web screenshots
+- English instructions only
+- May struggle with unusual UI layouts not seen in training
+## License
+Apache 2.0
+## Citation
+```bibtex
+@misc{tess2024,
+  title={TESS: Text-Enabled Screen Sense},
+  author={Hussein Lezzaik},
+  year={2024},
+  url={https://github.com/yourusername/TESS}
+}
+```