Improve model card: Add paper abstract, usage, features, and metadata
Browse filesThis PR significantly enhances the model card for `InternVLA-M1_object` by:
* Adding `pipeline_tag: robotics` for better discoverability on the Hugging Face Hub, categorizing it appropriately as a robot policy model.
* Specifying `library_name: transformers` to enable the automated "how to use" widget, as the model's `from_pretrained` method and use of Qwen-VL components indicate compatibility with the Transformers library.
* Including the full paper abstract to provide a comprehensive overview of the model's methodology and results.
* Integrating detailed sections from the project's GitHub README, such as "Key Features," "Target Audience," "Experimental Results," "Environment Setup," "Quick Interactive M1 Demo" (with code snippets for chat and action prediction), and "Model Zoo."
* Adding direct links to the paper, project page, GitHub repository, and a YouTube demo video, along with a prominent teaser image.
* Updating the citation to the `@article` BibTeX format as provided in the GitHub repository.
* Incorporating "Contributing," "Contact," and "Acknowledgements" sections for a more complete model card.
These improvements provide a much richer and more actionable resource for users exploring the InternVLA-M1 model.
|
@@ -4,25 +4,201 @@ tags:
|
|
| 4 |
- robotics
|
| 5 |
- vision-language-action-model
|
| 6 |
- vision-language-model
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
| 8 |
# Model Card for InternVLA-M1_object
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
- 🌐 Homepage: [InternVLA-M1 Project Page](https://internrobotics.github.io/internvla-m1.github.io/)
|
| 11 |
- 💻 Codebase: [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1)
|
| 12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
## Training Details
|
| 14 |
```
|
| 15 |
action_chunk: 8
|
| 16 |
batch_size: 128
|
| 17 |
training_steps: 30k
|
| 18 |
```
|
|
|
|
| 19 |
|
| 20 |
-
##
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
}
|
| 28 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
- robotics
|
| 5 |
- vision-language-action-model
|
| 6 |
- vision-language-model
|
| 7 |
+
pipeline_tag: robotics
|
| 8 |
+
library_name: transformers
|
| 9 |
---
|
| 10 |
+
|
| 11 |
# Model Card for InternVLA-M1_object
|
| 12 |
+
|
| 13 |
+
InternVLA-M1 is an open-source, end-to-end vision–language–action (VLA) framework for building and researching generalist robot policies, as introduced in the paper: [InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy](https://huggingface.co/papers/2510.13778).
|
| 14 |
+
|
| 15 |
+
[](https://arxiv.org/abs/2510.13778) [](https://internrobotics.github.io/internvla-m1.github.io) [](https://youtu.be/n129VDqJCk4)
|
| 16 |
+
|
| 17 |
- 🌐 Homepage: [InternVLA-M1 Project Page](https://internrobotics.github.io/internvla-m1.github.io/)
|
| 18 |
- 💻 Codebase: [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1)
|
| 19 |
|
| 20 |
+
<div align="center">
|
| 21 |
+
<img src="https://github.com/InternRobotics/InternVLA-M1/assets/e83ae046-a503-46a8-95e4-ef381919b7f8" alt="Teaser Image" width="100%"/>
|
| 22 |
+
</div>
|
| 23 |
+
|
| 24 |
+
## Abstract
|
| 25 |
+
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots.
|
| 26 |
+
|
| 27 |
+
## 🔥 Key Features
|
| 28 |
+
|
| 29 |
+
1. **Modular & Extensible**
|
| 30 |
+
All core components (model architecture, training data, training strategies, evaluation pipeline) are fully decoupled, enabling independent development, debugging, and extension of each module.
|
| 31 |
+
|
| 32 |
+
2. **Dual-System and Dual-Supervision**
|
| 33 |
+
InternVLA-M1 integrates both a language head and an action head under a unified framework, enabling collaborative training with dual supervision.
|
| 34 |
+
|
| 35 |
+
3. **Efficient Training & Fast Convergence**
|
| 36 |
+
Learns spatial and visual priors from large-scale multimodal pretraining and transfers them via spatial prompt fine-tuning. Achieves strong performance (e.g., SOTA-level convergence on in ~2.5 epochs without separate action pretraining).
|
| 37 |
+
|
| 38 |
+
## 🎯 Target Audience
|
| 39 |
+
|
| 40 |
+
1. Users who want to leverage open-source VLMs (e.g., Qwen2.5-VL) for robot control.
|
| 41 |
+
2. Teams co-training action datasets jointly with multimodal (vision–language) data.
|
| 42 |
+
3. Researchers exploring alternative VLA architectures and training strategies.
|
| 43 |
+
|
| 44 |
+
## 📊 Experimental Results
|
| 45 |
+
| | WindowX | Google Robot(VA) | Google Robot(VM) | LIBERO |
|
| 46 |
+
|-------------|---------|------------------|------------------|--------|
|
| 47 |
+
| $\pi_0$ | 27.1 | 54.8 | 58.8 | 94.2 |
|
| 48 |
+
| GR00t | 61.9 | 44.5 | 35.2 | 93.9 |
|
| 49 |
+
| InternVLA-M1 |**71.7** |**76.0** |**80.7** |**95.9**|
|
| 50 |
+
|
| 51 |
+
## 🚀 Quick Start
|
| 52 |
+
|
| 53 |
+
### 🛠 Environment Setup
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
# Clone the repo
|
| 57 |
+
git clone https://github.com/InternRobotics/InternVLA-M1
|
| 58 |
+
|
| 59 |
+
# Create conda environment
|
| 60 |
+
conda create -n internvla-m1 python=3.10 -y
|
| 61 |
+
conda activate internvla-m1
|
| 62 |
+
|
| 63 |
+
# Install requirements
|
| 64 |
+
pip install -r requirements.txt
|
| 65 |
+
|
| 66 |
+
# Install FlashAttention2
|
| 67 |
+
pip install flash-attn --no-build-isolation
|
| 68 |
+
|
| 69 |
+
# Install InternVLA-M1
|
| 70 |
+
pip install -e .
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
### ⚡ Quick Interactive M1 Demo
|
| 74 |
+
|
| 75 |
+
Below are two collapsible examples: InternVLA-M1 chat and action prediction.
|
| 76 |
+
|
| 77 |
+
<details open>
|
| 78 |
+
<summary><b>InternVLA-M1 Chat Demo (image Q&A / Spatial Grounding)</b></summary>
|
| 79 |
+
|
| 80 |
+
```python
|
| 81 |
+
from InternVLA.model.framework.M1 import InternVLA_M1
|
| 82 |
+
from PIL import Image
|
| 83 |
+
import requests
|
| 84 |
+
from io import BytesIO
|
| 85 |
+
import torch
|
| 86 |
+
|
| 87 |
+
def load_image_from_url(url: str) -> Image.Image:
|
| 88 |
+
resp = requests.get(url, timeout=15)
|
| 89 |
+
resp.raise_for_status()
|
| 90 |
+
img = Image.open(BytesIO(resp.content)).convert("RGB")
|
| 91 |
+
return img
|
| 92 |
+
|
| 93 |
+
saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
|
| 94 |
+
internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)
|
| 95 |
+
|
| 96 |
+
# Use the raw image link for direct download
|
| 97 |
+
image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
|
| 98 |
+
image = load_image_from_url(image_url)
|
| 99 |
+
question = "Give the bounding box for the apple."
|
| 100 |
+
response = internVLA_M1.chat_with_M1(image, question)
|
| 101 |
+
print(response)
|
| 102 |
+
```
|
| 103 |
+
</details>
|
| 104 |
+
|
| 105 |
+
<details>
|
| 106 |
+
<summary><b>InternVLA-M1 Action Prediction Demo (two views)</b></summary>
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
from InternVLA.model.framework.M1 import InternVLA_M1
|
| 110 |
+
from PIL import Image
|
| 111 |
+
import requests
|
| 112 |
+
from io import BytesIO
|
| 113 |
+
import torch
|
| 114 |
+
|
| 115 |
+
def load_image_from_url(url: str) -> Image.Image:
|
| 116 |
+
resp = requests.get(url, timeout=15)
|
| 117 |
+
resp.raise_for_status()
|
| 118 |
+
img = Image.open(BytesIO(resp.content)).convert("RGB")
|
| 119 |
+
return img
|
| 120 |
+
|
| 121 |
+
saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
|
| 122 |
+
internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)
|
| 123 |
+
|
| 124 |
+
image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
|
| 125 |
+
view1 = load_image_from_url(image_url)
|
| 126 |
+
view2 = view1.copy()
|
| 127 |
+
|
| 128 |
+
# Construct input: batch size = 1, two views
|
| 129 |
+
batch_images = [[view1, view2]] # List[List[PIL.Image]]
|
| 130 |
+
instructions = ["Pick up the apple and place it on the plate."]
|
| 131 |
+
|
| 132 |
+
if torch.cuda.is_available():
|
| 133 |
+
internVLA_M1 = internVLA_M1.to("cuda")
|
| 134 |
+
|
| 135 |
+
pred = internVLA_M1.predict_action(
|
| 136 |
+
batch_images=batch_images,
|
| 137 |
+
instructions=instructions,
|
| 138 |
+
cfg_scale=1.5,
|
| 139 |
+
use_ddim=True,
|
| 140 |
+
num_ddim_steps=10,
|
| 141 |
+
)
|
| 142 |
+
normalized_actions = pred["normalized_actions"] # [B, T, action_dim]
|
| 143 |
+
print(normalized_actions.shape, type(normalized_actions))
|
| 144 |
+
```
|
| 145 |
+
</details>
|
| 146 |
+
|
| 147 |
## Training Details
|
| 148 |
```
|
| 149 |
action_chunk: 8
|
| 150 |
batch_size: 128
|
| 151 |
training_steps: 30k
|
| 152 |
```
|
| 153 |
+
For more detailed training scripts and datasets, please refer to the [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1).
|
| 154 |
|
| 155 |
+
## 📈 Model Zoo
|
| 156 |
+
We release a series of pretrained models and checkpoints to facilitate reproduction and downstream use.
|
| 157 |
+
|
| 158 |
+
### ✅ Available Checkpoints
|
| 159 |
+
|
| 160 |
+
| Model | Description | Link |
|
| 161 |
+
|-------|-------------|------|
|
| 162 |
+
| **InternVLA-M1** | Main pretrained model | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1) |
|
| 163 |
+
| **InternVLA-M1-Pretrain-RT-1-Bridge** | Pretraining on RT-1 Bridge data | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-Pretrain-RT-1-Bridge) |
|
| 164 |
+
| **InternVLA-M1-LIBERO-Long** | Fine-tuned on LIBERO Long-horizon tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Long) |
|
| 165 |
+
| **InternVLA-M1-LIBERO-Goal** | Fine-tuned on LIBERO Goal-conditioned tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Goal) |
|
| 166 |
+
| **InternVLA-M1-LIBERO-Spatial** | Fine-tuned on LIBERO Spatial reasoning tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Spatial) |
|
| 167 |
+
| **InternVLA-M1-LIBERO-Object** | Fine-tuned on LIBERO Object-centric tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Object) |
|
| 168 |
+
|
| 169 |
+
## 🤝 Contributing
|
| 170 |
+
|
| 171 |
+
We welcome contributions via Pull Requests or Issues.
|
| 172 |
+
Please include detailed logs and reproduction steps when reporting bugs.
|
| 173 |
+
|
| 174 |
+
## 📜 Citation
|
| 175 |
+
|
| 176 |
+
If you find this useful in your research, please consider citing:
|
| 177 |
+
|
| 178 |
+
```bibtex
|
| 179 |
+
@article{internvlam1,
|
| 180 |
+
title = {InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy},
|
| 181 |
+
author = {InternVLA-M1 Contributors},
|
| 182 |
+
journal = {arXiv preprint arXiv:2510.13778},
|
| 183 |
+
year = {2025}
|
| 184 |
}
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
## 📬 Contact
|
| 188 |
+
|
| 189 |
+
* Issues: Submit via GitHub Issues with detailed logs and steps
|
| 190 |
+
|
| 191 |
+
## 🙏 Acknowledgements
|
| 192 |
+
|
| 193 |
+
We thank the open-source community for their inspiring work. This project builds upon and is inspired by the following projects (alphabetical order):
|
| 194 |
+
- [IPEC-COMMUNITY](https://huggingface.co/IPEC-COMMUNITY): Curated OXE / LIBERO style multi-task datasets and formatting examples.
|
| 195 |
+
- [Isaac-GR00T](https://github.com/NVIDIA/Isaac-GR00T): Standardized action data loader (GR00T-LeRobot).
|
| 196 |
+
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-finetune/README.md): Multimodal input/output format, data loader, and pretrained VLM backbone.
|
| 197 |
+
- [CogACT](https://github.com/microsoft/CogACT/tree/main/action_model): Reference for a DiT-style action head design.
|
| 198 |
+
- [Llavavla](https://github.com/JinhuiYE/llavavla): Baseline code structure and engineering design references.
|
| 199 |
+
- [GenManip Simulation Platform](https://github.com/InternRobotics/GenManip): Simulation platform for generalizable pick-and-place based on Isaac Sim.
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
Thanks for using **InternVLA-M1**! 🌟
|
| 204 |
+
If you find it useful, please consider giving us a ⭐ on GitHub.
|