OpenGVLab
/

Vlaser-2B-VLA

Model card Files Files and versions

xet

Community

Add model card for Vlaser: Vision-Language-Action Model

by nielsr HF Staff - opened Oct 15

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+42

-0

Files changed (1) hide show

README.md +42 -0

README.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+license: mit
+pipeline_tag: robotics
+---
+# Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
+This repository contains the Vlaser model, introduced in the paper [Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning](https://huggingface.co/papers/2510.11027).
+**Project Page**: [https://internvl.github.io/blog/2025-10-11-Vlaser/](https://internvl.github.io/blog/2025-10-11-Vlaser/)
+**Code**: [https://github.com/OpenGVLab/Vlaser/](https://github.com/OpenGVLab/Vlaser/)
+<p align="center">
+<img src="https://github.com/OpenGVLab/Vlaser/raw/main/images/embodied_fig1_1.png" alt="Vlaser Overview" style="width: 100%; height: auto;" />
+</p>
+## Introduction
+While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing **Vlaser** -- a **V**ision-**L**anguage-**A**ction Model with **s**ynergistic **e**mbodied **r**easoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality **Vlaser-6M** dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks—including spatial reasoning, embodied grounding, embodied QA, and task planning.
+Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
+## News
+- **`2025-10-13`**: 🤖 We release Vlaser VLM model (Vlaser-2B and Vlaser-8B) as well as VLA model (Vlaser-2B-VLA) on [🤗Vlaser](https://huggingface.co/collections/OpenGVLab/vlaser-68e9fd4178da453c348997f8).
+- **`2025-10-13`**: 🤖 We release the training and inference code of Vlaser VLM based on [InternVL3](https://github.com/OpenGVLab/InternVL).
+## Quick Start
+For details on Vlaser VLM, please refer to the [Vlaser VLM Quick Start Guide](https://github.com/OpenGVLab/Vlaser/tree/main/Vlaser_VLM) in the GitHub repository.
+## Citation
+If you find this work helpful in your research, please consider citing our paper:
+```bibtex
+@article{luo2025visual,
+  title={Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces},
+  author={Luo, Gen and Yang, Ganlin and Gong, Ziyang and Chen, Guanzhou and Duan, Haonan and Cui, Erfei and Tong, Ronglei and Hou, Zhi and Zhang, Tianyi and Chen, Zhe and others},
+  journal={arXiv preprint arXiv:2506.00123},
+  year={2025}
+}
+```