HusseinLezzaik commited on
Commit
e8ea8b5
Β·
verified Β·
1 Parent(s): 9f182db

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - computer-use
7
+ - gui-agent
8
+ - vision-language-model
9
+ - screen-understanding
10
+ datasets:
11
+ - TESS-Computer/agentnet
12
+ base_model: HuggingFaceTB/SmolVLM2-500M-Instruct
13
+ pipeline_tag: image-text-to-text
14
+ ---
15
+
16
+ # TESS-500M
17
+
18
+ **TESS (Text-Enabled Screen Sense)** is a Vision-Language-Action model for computer use. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).
19
+
20
+ ## Model Description
21
+
22
+ - **Base Model**: SmolVLM2-500M-Instruct
23
+ - **Architecture**: SmolVLM + Router + Mouse/Keyboard heads
24
+ - **Parameters**: 508M total, 48M trainable
25
+ - **Training Data**: [AgentNet](https://huggingface.co/datasets/TESS-Computer/agentnet) (~312K samples)
26
+
27
+ ## Usage
28
+
29
+ ```python
30
+ import torch
31
+ from PIL import Image
32
+
33
+ # Clone the TESS repo
34
+ # git clone https://github.com/yourusername/TESS.git
35
+ # cd TESS/model
36
+
37
+ from test_checkpoint import load_model, predict
38
+
39
+ # Load model
40
+ model, processor = load_model("path/to/checkpoint.pt", device="cuda")
41
+
42
+ # Run inference
43
+ image = Image.open("screenshot.png")
44
+ result = predict(model, processor, image, "Click the search button")
45
+
46
+ print(result)
47
+ # Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
48
+ # Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}
49
+ ```
50
+
51
+ ## Output Format
52
+
53
+ **Mouse actions:**
54
+ ```python
55
+ {
56
+ 'action_type': 'mouse',
57
+ 'xy': [x, y], # Normalized coordinates (0-1)
58
+ 'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ...
59
+ }
60
+ ```
61
+
62
+ **Keyboard actions:**
63
+ ```python
64
+ {
65
+ 'action_type': 'keyboard',
66
+ 'action': 'type' | 'press' | 'hotkey',
67
+ 'value': 'text to type' | '<ENTER>' | '<SUPER+C>'
68
+ }
69
+ ```
70
+
71
+ ## Architecture
72
+
73
+ ```
74
+ Screenshot + Instruction β†’ SmolVLM2 β†’ Shared MLP β†’ Router
75
+ ↓
76
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
77
+ ↓ ↓
78
+ Mouse Branch Keyboard Branch
79
+ (XY + Click heads) (VLM text generation)
80
+ ```
81
+
82
+ ## Training
83
+
84
+ - **Epochs**: 3
85
+ - **Batch Size**: 48
86
+ - **Optimizer**: AdamW (LR 2e-4 heads, 5e-4 embeddings)
87
+ - **Hardware**: NVIDIA H100 80GB
88
+ - **Training Time**: ~8 hours
89
+
90
+ ## Limitations
91
+
92
+ - Trained primarily on desktop/web screenshots
93
+ - English instructions only
94
+ - May struggle with unusual UI layouts not seen in training
95
+
96
+ ## License
97
+
98
+ Apache 2.0
99
+
100
+ ## Citation
101
+
102
+ ```bibtex
103
+ @misc{tess2024,
104
+ title={TESS: Text-Enabled Screen Sense},
105
+ author={Hussein Lezzaik},
106
+ year={2024},
107
+ url={https://github.com/yourusername/TESS}
108
+ }
109
+ ```