Abstract
ShowUI-Aloha presents a pipeline that converts unstructured human screen recordings into structured GUI tasks through recording, semantic interpretation, planning, and execution components.
Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn from.To address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration (2025)
- OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent (2026)
- iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception (2025)
- Step-GUI Technical Report (2025)
- ColorBrowserAgent: An Intelligent GUI Agent for Complex Long-Horizon Web Automation (2026)
- ShowUI-$\pi$: Flow-based Generative Models as GUI Dexterous Hands (2025)
- Computer-Use Agents as Judges for Generative User Interface (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper