Papers
arxiv:2601.12993

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Published on Jan 19
· Submitted by
Wanpeng Zhang
on Jan 21
#1 Paper of the day
Authors:
,
,
,
,

Abstract

Being-H0.5 is a Vision-Language-Action model that enables robust cross-embodiment generalization through human-centric learning and a Mixture-of-Transformers architecture with specialized embodiment handling.

AI-generated summary

We introduce Being-H0.5, a foundational Vision-Language-Action (VLA) model designed for robust cross-embodiment generalization across diverse robotic platforms. While existing VLAs often struggle with morphological heterogeneity and data scarcity, we propose a human-centric learning paradigm that treats human interaction traces as a universal "mother tongue" for physical interaction. To support this, we present UniHand-2.0, the largest embodied pre-training recipe to date, comprising over 35,000 hours of multimodal data across 30 distinct robotic embodiments. Our approach introduces a Unified Action Space that maps heterogeneous robot controls into semantically aligned slots, enabling low-resource robots to bootstrap skills from human data and high-resource platforms. Built upon this human-centric foundation, we design a unified sequential modeling and multi-task pre-training paradigm to bridge human demonstrations and robotic execution. Architecturally, Being-H0.5 utilizes a Mixture-of-Transformers design featuring a novel Mixture-of-Flow (MoF) framework to decouple shared motor primitives from specialized embodiment-specific experts. Finally, to make cross-embodiment policies stable in the real world, we introduce Manifold-Preserving Gating for robustness under sensory shift and Universal Async Chunking to universalize chunked control across embodiments with different latency and control profiles. We empirically demonstrate that Being-H0.5 achieves state-of-the-art results on simulated benchmarks, such as LIBERO (98.9%) and RoboCasa (53.9%), while also exhibiting strong cross-embodiment capabilities on five robotic platforms.

Community

Paper author Paper submitter

We scale human-centric robot learning with Being-H0.5 toward cross-embodiment generalization. Building on over 35,000 hours data, we unify human hand motion and diverse robot embodiments with a Unified Action Space, and train all heterogeneous supervision through unified sequence modeling under a single framework. This yields a single foundation model that can perceive, describe, and act within one framework, enabling robust cross-embodiment generalization and real-world deployment across diverse robots and tasks.

Blog: https://research.beingbeyond.com/being-h05
arXiv: https://arxiv.org/pdf/2601.12993
GitHub: https://github.com/BeingBeyond/Being-H
HuggingFace: https://huggingface.co/collections/BeingBeyond/being-h05

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.12993 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.12993 in a Space README.md to link it from this page.

Collections including this paper 1