Add model card for Vlaser: Vision-Language-Action Model

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: robotics
4
+ ---
5
+
6
+ # Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
7
+
8
+ This repository contains the Vlaser model, introduced in the paper [Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning](https://huggingface.co/papers/2510.11027).
9
+
10
+ **Project Page**: [https://internvl.github.io/blog/2025-10-11-Vlaser/](https://internvl.github.io/blog/2025-10-11-Vlaser/)
11
+ **Code**: [https://github.com/OpenGVLab/Vlaser/](https://github.com/OpenGVLab/Vlaser/)
12
+
13
+ <p align="center">
14
+ <img src="https://github.com/OpenGVLab/Vlaser/raw/main/images/embodied_fig1_1.png" alt="Vlaser Overview" style="width: 100%; height: auto;" />
15
+ </p>
16
+
17
+ ## Introduction
18
+
19
+ While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing **Vlaser** -- a **V**ision-**L**anguage-**A**ction Model with **s**ynergistic **e**mbodied **r**easoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality **Vlaser-6M** dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks—including spatial reasoning, embodied grounding, embodied QA, and task planning.
20
+ Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
21
+
22
+ ## News
23
+
24
+ - **`2025-10-13`**: 🤖 We release Vlaser VLM model (Vlaser-2B and Vlaser-8B) as well as VLA model (Vlaser-2B-VLA) on [🤗Vlaser](https://huggingface.co/collections/OpenGVLab/vlaser-68e9fd4178da453c348997f8).
25
+ - **`2025-10-13`**: 🤖 We release the training and inference code of Vlaser VLM based on [InternVL3](https://github.com/OpenGVLab/InternVL).
26
+
27
+ ## Quick Start
28
+
29
+ For details on Vlaser VLM, please refer to the [Vlaser VLM Quick Start Guide](https://github.com/OpenGVLab/Vlaser/tree/main/Vlaser_VLM) in the GitHub repository.
30
+
31
+ ## Citation
32
+
33
+ If you find this work helpful in your research, please consider citing our paper:
34
+
35
+ ```bibtex
36
+ @article{luo2025visual,
37
+ title={Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces},
38
+ author={Luo, Gen and Yang, Ganlin and Gong, Ziyang and Chen, Guanzhou and Duan, Haonan and Cui, Erfei and Tong, Ronglei and Hou, Zhi and Zhang, Tianyi and Chen, Zhe and others},
39
+ journal={arXiv preprint arXiv:2506.00123},
40
+ year={2025}
41
+ }
42
+ ```