Abstract
A world model for desktop software that predicts UI state changes through textual description followed by visual synthesis, improving decision quality and execution robustness in computer-using tasks.
Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.
Community
CUWM learns a two-stage desktop UI world model that predicts textual state changes and renders next screenshots, enabling action search to improve decision quality in office software.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MobileDreamer: Generative Sketch World Model for GUI Agent (2026)
- OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution (2026)
- DynaWeb: Model-Based Reinforcement Learning of Web Agents (2026)
- MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux (2026)
- OpAgent: Operator Agent for Web Navigation (2026)
- EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience (2026)
- Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper