nielsr HF Staff commited on
Commit
048e92b
·
verified ·
1 Parent(s): f5cb42d

Improve model card: Add paper abstract, usage, features, and metadata

Browse files

This PR significantly enhances the model card for `InternVLA-M1_object` by:

* Adding `pipeline_tag: robotics` for better discoverability on the Hugging Face Hub, categorizing it appropriately as a robot policy model.
* Specifying `library_name: transformers` to enable the automated "how to use" widget, as the model's `from_pretrained` method and use of Qwen-VL components indicate compatibility with the Transformers library.
* Including the full paper abstract to provide a comprehensive overview of the model's methodology and results.
* Integrating detailed sections from the project's GitHub README, such as "Key Features," "Target Audience," "Experimental Results," "Environment Setup," "Quick Interactive M1 Demo" (with code snippets for chat and action prediction), and "Model Zoo."
* Adding direct links to the paper, project page, GitHub repository, and a YouTube demo video, along with a prominent teaser image.
* Updating the citation to the `@article` BibTeX format as provided in the GitHub repository.
* Incorporating "Contributing," "Contact," and "Acknowledgements" sections for a more complete model card.

These improvements provide a much richer and more actionable resource for users exploring the InternVLA-M1 model.

Files changed (1) hide show
  1. README.md +185 -9
README.md CHANGED
@@ -4,25 +4,201 @@ tags:
4
  - robotics
5
  - vision-language-action-model
6
  - vision-language-model
 
 
7
  ---
 
8
  # Model Card for InternVLA-M1_object
9
- InternVLA-M1 is an open-source, end-to-end vision–language–action (VLA) framework for building and researching generalist robot policies.
 
 
 
 
10
  - 🌐 Homepage: [InternVLA-M1 Project Page](https://internrobotics.github.io/internvla-m1.github.io/)
11
  - 💻 Codebase: [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1)
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ## Training Details
14
  ```
15
  action_chunk: 8
16
  batch_size: 128
17
  training_steps: 30k
18
  ```
 
19
 
20
- ## Citation
21
- ```
22
- @misc{internvla2024,
23
- title = {InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy},
24
- author = {InternVLA-M1 Contributors},
25
- year = {2025},
26
- booktitle={arXiv},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  }
28
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - robotics
5
  - vision-language-action-model
6
  - vision-language-model
7
+ pipeline_tag: robotics
8
+ library_name: transformers
9
  ---
10
+
11
  # Model Card for InternVLA-M1_object
12
+
13
+ InternVLA-M1 is an open-source, end-to-end vision–language–action (VLA) framework for building and researching generalist robot policies, as introduced in the paper: [InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy](https://huggingface.co/papers/2510.13778).
14
+
15
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/abs/2510.13778) [![Website](https://img.shields.io/badge/Website-GitHub%20Pages-blue.svg)](https://internrobotics.github.io/internvla-m1.github.io) [![Demo](https://img.shields.io/badge/Demo-YouTube-red.svg)](https://youtu.be/n129VDqJCk4)
16
+
17
  - 🌐 Homepage: [InternVLA-M1 Project Page](https://internrobotics.github.io/internvla-m1.github.io/)
18
  - 💻 Codebase: [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1)
19
 
20
+ <div align="center">
21
+ <img src="https://github.com/InternRobotics/InternVLA-M1/assets/e83ae046-a503-46a8-95e4-ef381919b7f8" alt="Teaser Image" width="100%"/>
22
+ </div>
23
+
24
+ ## Abstract
25
+ We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots.
26
+
27
+ ## 🔥 Key Features
28
+
29
+ 1. **Modular & Extensible**
30
+ All core components (model architecture, training data, training strategies, evaluation pipeline) are fully decoupled, enabling independent development, debugging, and extension of each module.
31
+
32
+ 2. **Dual-System and Dual-Supervision**
33
+ InternVLA-M1 integrates both a language head and an action head under a unified framework, enabling collaborative training with dual supervision.
34
+
35
+ 3. **Efficient Training & Fast Convergence**
36
+ Learns spatial and visual priors from large-scale multimodal pretraining and transfers them via spatial prompt fine-tuning. Achieves strong performance (e.g., SOTA-level convergence on in ~2.5 epochs without separate action pretraining).
37
+
38
+ ## 🎯 Target Audience
39
+
40
+ 1. Users who want to leverage open-source VLMs (e.g., Qwen2.5-VL) for robot control.
41
+ 2. Teams co-training action datasets jointly with multimodal (vision–language) data.
42
+ 3. Researchers exploring alternative VLA architectures and training strategies.
43
+
44
+ ## 📊 Experimental Results
45
+ | | WindowX | Google Robot(VA) | Google Robot(VM) | LIBERO |
46
+ |-------------|---------|------------------|------------------|--------|
47
+ | $\pi_0$ | 27.1 | 54.8 | 58.8 | 94.2 |
48
+ | GR00t | 61.9 | 44.5 | 35.2 | 93.9 |
49
+ | InternVLA-M1 |**71.7** |**76.0** |**80.7** |**95.9**|
50
+
51
+ ## 🚀 Quick Start
52
+
53
+ ### 🛠 Environment Setup
54
+
55
+ ```bash
56
+ # Clone the repo
57
+ git clone https://github.com/InternRobotics/InternVLA-M1
58
+
59
+ # Create conda environment
60
+ conda create -n internvla-m1 python=3.10 -y
61
+ conda activate internvla-m1
62
+
63
+ # Install requirements
64
+ pip install -r requirements.txt
65
+
66
+ # Install FlashAttention2
67
+ pip install flash-attn --no-build-isolation
68
+
69
+ # Install InternVLA-M1
70
+ pip install -e .
71
+ ```
72
+
73
+ ### ⚡ Quick Interactive M1 Demo
74
+
75
+ Below are two collapsible examples: InternVLA-M1 chat and action prediction.
76
+
77
+ <details open>
78
+ <summary><b>InternVLA-M1 Chat Demo (image Q&A / Spatial Grounding)</b></summary>
79
+
80
+ ```python
81
+ from InternVLA.model.framework.M1 import InternVLA_M1
82
+ from PIL import Image
83
+ import requests
84
+ from io import BytesIO
85
+ import torch
86
+
87
+ def load_image_from_url(url: str) -> Image.Image:
88
+ resp = requests.get(url, timeout=15)
89
+ resp.raise_for_status()
90
+ img = Image.open(BytesIO(resp.content)).convert("RGB")
91
+ return img
92
+
93
+ saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
94
+ internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)
95
+
96
+ # Use the raw image link for direct download
97
+ image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
98
+ image = load_image_from_url(image_url)
99
+ question = "Give the bounding box for the apple."
100
+ response = internVLA_M1.chat_with_M1(image, question)
101
+ print(response)
102
+ ```
103
+ </details>
104
+
105
+ <details>
106
+ <summary><b>InternVLA-M1 Action Prediction Demo (two views)</b></summary>
107
+
108
+ ```python
109
+ from InternVLA.model.framework.M1 import InternVLA_M1
110
+ from PIL import Image
111
+ import requests
112
+ from io import BytesIO
113
+ import torch
114
+
115
+ def load_image_from_url(url: str) -> Image.Image:
116
+ resp = requests.get(url, timeout=15)
117
+ resp.raise_for_status()
118
+ img = Image.open(BytesIO(resp.content)).convert("RGB")
119
+ return img
120
+
121
+ saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
122
+ internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)
123
+
124
+ image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
125
+ view1 = load_image_from_url(image_url)
126
+ view2 = view1.copy()
127
+
128
+ # Construct input: batch size = 1, two views
129
+ batch_images = [[view1, view2]] # List[List[PIL.Image]]
130
+ instructions = ["Pick up the apple and place it on the plate."]
131
+
132
+ if torch.cuda.is_available():
133
+ internVLA_M1 = internVLA_M1.to("cuda")
134
+
135
+ pred = internVLA_M1.predict_action(
136
+ batch_images=batch_images,
137
+ instructions=instructions,
138
+ cfg_scale=1.5,
139
+ use_ddim=True,
140
+ num_ddim_steps=10,
141
+ )
142
+ normalized_actions = pred["normalized_actions"] # [B, T, action_dim]
143
+ print(normalized_actions.shape, type(normalized_actions))
144
+ ```
145
+ </details>
146
+
147
  ## Training Details
148
  ```
149
  action_chunk: 8
150
  batch_size: 128
151
  training_steps: 30k
152
  ```
153
+ For more detailed training scripts and datasets, please refer to the [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1).
154
 
155
+ ## 📈 Model Zoo
156
+ We release a series of pretrained models and checkpoints to facilitate reproduction and downstream use.
157
+
158
+ ### Available Checkpoints
159
+
160
+ | Model | Description | Link |
161
+ |-------|-------------|------|
162
+ | **InternVLA-M1** | Main pretrained model | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1) |
163
+ | **InternVLA-M1-Pretrain-RT-1-Bridge** | Pretraining on RT-1 Bridge data | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-Pretrain-RT-1-Bridge) |
164
+ | **InternVLA-M1-LIBERO-Long** | Fine-tuned on LIBERO Long-horizon tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Long) |
165
+ | **InternVLA-M1-LIBERO-Goal** | Fine-tuned on LIBERO Goal-conditioned tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Goal) |
166
+ | **InternVLA-M1-LIBERO-Spatial** | Fine-tuned on LIBERO Spatial reasoning tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Spatial) |
167
+ | **InternVLA-M1-LIBERO-Object** | Fine-tuned on LIBERO Object-centric tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Object) |
168
+
169
+ ## 🤝 Contributing
170
+
171
+ We welcome contributions via Pull Requests or Issues.
172
+ Please include detailed logs and reproduction steps when reporting bugs.
173
+
174
+ ## 📜 Citation
175
+
176
+ If you find this useful in your research, please consider citing:
177
+
178
+ ```bibtex
179
+ @article{internvlam1,
180
+ title = {InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy},
181
+ author = {InternVLA-M1 Contributors},
182
+ journal = {arXiv preprint arXiv:2510.13778},
183
+ year = {2025}
184
  }
185
+ ```
186
+
187
+ ## 📬 Contact
188
+
189
+ * Issues: Submit via GitHub Issues with detailed logs and steps
190
+
191
+ ## 🙏 Acknowledgements
192
+
193
+ We thank the open-source community for their inspiring work. This project builds upon and is inspired by the following projects (alphabetical order):
194
+ - [IPEC-COMMUNITY](https://huggingface.co/IPEC-COMMUNITY): Curated OXE / LIBERO style multi-task datasets and formatting examples.
195
+ - [Isaac-GR00T](https://github.com/NVIDIA/Isaac-GR00T): Standardized action data loader (GR00T-LeRobot).
196
+ - [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-finetune/README.md): Multimodal input/output format, data loader, and pretrained VLM backbone.
197
+ - [CogACT](https://github.com/microsoft/CogACT/tree/main/action_model): Reference for a DiT-style action head design.
198
+ - [Llavavla](https://github.com/JinhuiYE/llavavla): Baseline code structure and engineering design references.
199
+ - [GenManip Simulation Platform](https://github.com/InternRobotics/GenManip): Simulation platform for generalizable pick-and-place based on Isaac Sim.
200
+
201
+ ---
202
+
203
+ Thanks for using **InternVLA-M1**! 🌟
204
+ If you find it useful, please consider giving us a ⭐ on GitHub.