Add model card for Vlaser: Vision-Language-Action Model
#1
by
nielsr
HF Staff
- opened
README.md
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
|
| 7 |
+
|
| 8 |
+
This repository contains the Vlaser model, introduced in the paper [Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning](https://huggingface.co/papers/2510.11027).
|
| 9 |
+
|
| 10 |
+
**Project Page**: [https://internvl.github.io/blog/2025-10-11-Vlaser/](https://internvl.github.io/blog/2025-10-11-Vlaser/)
|
| 11 |
+
**Code**: [https://github.com/OpenGVLab/Vlaser/](https://github.com/OpenGVLab/Vlaser/)
|
| 12 |
+
|
| 13 |
+
<p align="center">
|
| 14 |
+
<img src="https://github.com/OpenGVLab/Vlaser/raw/main/images/embodied_fig1_1.png" alt="Vlaser Overview" style="width: 100%; height: auto;" />
|
| 15 |
+
</p>
|
| 16 |
+
|
| 17 |
+
## Introduction
|
| 18 |
+
|
| 19 |
+
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing **Vlaser** -- a **V**ision-**L**anguage-**A**ction Model with **s**ynergistic **e**mbodied **r**easoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality **Vlaser-6M** dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks—including spatial reasoning, embodied grounding, embodied QA, and task planning.
|
| 20 |
+
Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
|
| 21 |
+
|
| 22 |
+
## News
|
| 23 |
+
|
| 24 |
+
- **`2025-10-13`**: 🤖 We release Vlaser VLM model (Vlaser-2B and Vlaser-8B) as well as VLA model (Vlaser-2B-VLA) on [🤗Vlaser](https://huggingface.co/collections/OpenGVLab/vlaser-68e9fd4178da453c348997f8).
|
| 25 |
+
- **`2025-10-13`**: 🤖 We release the training and inference code of Vlaser VLM based on [InternVL3](https://github.com/OpenGVLab/InternVL).
|
| 26 |
+
|
| 27 |
+
## Quick Start
|
| 28 |
+
|
| 29 |
+
For details on Vlaser VLM, please refer to the [Vlaser VLM Quick Start Guide](https://github.com/OpenGVLab/Vlaser/tree/main/Vlaser_VLM) in the GitHub repository.
|
| 30 |
+
|
| 31 |
+
## Citation
|
| 32 |
+
|
| 33 |
+
If you find this work helpful in your research, please consider citing our paper:
|
| 34 |
+
|
| 35 |
+
```bibtex
|
| 36 |
+
@article{luo2025visual,
|
| 37 |
+
title={Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces},
|
| 38 |
+
author={Luo, Gen and Yang, Ganlin and Gong, Ziyang and Chen, Guanzhou and Duan, Haonan and Cui, Erfei and Tong, Ronglei and Hou, Zhi and Zhang, Tianyi and Chen, Zhe and others},
|
| 39 |
+
journal={arXiv preprint arXiv:2506.00123},
|
| 40 |
+
year={2025}
|
| 41 |
+
}
|
| 42 |
+
```
|