nielsr's picture
nielsr HF Staff
Improve model card: Add paper abstract, usage, features, and metadata
048e92b verified
|
raw
history blame
9.9 kB
metadata
license: cc-by-nc-sa-4.0
tags:
  - robotics
  - vision-language-action-model
  - vision-language-model
pipeline_tag: robotics
library_name: transformers

Model Card for InternVLA-M1_object

InternVLA-M1 is an open-source, end-to-end vision–language–action (VLA) framework for building and researching generalist robot policies, as introduced in the paper: InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy.

Paper Website Demo

Teaser Image

Abstract

We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots.

πŸ”₯ Key Features

  1. Modular & Extensible
    All core components (model architecture, training data, training strategies, evaluation pipeline) are fully decoupled, enabling independent development, debugging, and extension of each module.

  2. Dual-System and Dual-Supervision InternVLA-M1 integrates both a language head and an action head under a unified framework, enabling collaborative training with dual supervision.

  3. Efficient Training & Fast Convergence Learns spatial and visual priors from large-scale multimodal pretraining and transfers them via spatial prompt fine-tuning. Achieves strong performance (e.g., SOTA-level convergence on in ~2.5 epochs without separate action pretraining).

🎯 Target Audience

  1. Users who want to leverage open-source VLMs (e.g., Qwen2.5-VL) for robot control.
  2. Teams co-training action datasets jointly with multimodal (vision–language) data.
  3. Researchers exploring alternative VLA architectures and training strategies.

πŸ“Š Experimental Results

WindowX Google Robot(VA) Google Robot(VM) LIBERO
$\pi_0$ 27.1 54.8 58.8 94.2
GR00t 61.9 44.5 35.2 93.9
InternVLA-M1 71.7 76.0 80.7 95.9

πŸš€ Quick Start

πŸ›  Environment Setup

# Clone the repo
git clone https://github.com/InternRobotics/InternVLA-M1

# Create conda environment
conda create -n internvla-m1 python=3.10 -y
conda activate internvla-m1

# Install requirements
pip install -r requirements.txt

# Install FlashAttention2
pip install flash-attn --no-build-isolation

# Install InternVLA-M1
pip install -e .

⚑ Quick Interactive M1 Demo

Below are two collapsible examples: InternVLA-M1 chat and action prediction.

InternVLA-M1 Chat Demo (image Q&A / Spatial Grounding)
from InternVLA.model.framework.M1 import InternVLA_M1
from PIL import Image
import requests
from io import BytesIO
import torch

def load_image_from_url(url: str) -> Image.Image:
    resp = requests.get(url, timeout=15)
    resp.raise_for_status()
    img = Image.open(BytesIO(resp.content)).convert("RGB")
    return img

saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)

# Use the raw image link for direct download
image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
image = load_image_from_url(image_url)
question = "Give the bounding box for the apple."
response = internVLA_M1.chat_with_M1(image, question)
print(response)
InternVLA-M1 Action Prediction Demo (two views)
from InternVLA.model.framework.M1 import InternVLA_M1
from PIL import Image
import requests
from io import BytesIO
import torch

def load_image_from_url(url: str) -> Image.Image:
    resp = requests.get(url, timeout=15)
    resp.raise_for_status()
    img = Image.open(BytesIO(resp.content)).convert("RGB")
    return img

saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)

image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
view1 = load_image_from_url(image_url)
view2 = view1.copy()

# Construct input: batch size = 1, two views
batch_images = [[view1, view2]]  # List[List[PIL.Image]]
instructions = ["Pick up the apple and place it on the plate."]

if torch.cuda.is_available():
    internVLA_M1 = internVLA_M1.to("cuda")

pred = internVLA_M1.predict_action(
    batch_images=batch_images,
    instructions=instructions,
    cfg_scale=1.5,
    use_ddim=True,
    num_ddim_steps=10,
)
normalized_actions = pred["normalized_actions"]  # [B, T, action_dim]
print(normalized_actions.shape, type(normalized_actions))

Training Details

action_chunk: 8
batch_size: 128
training_steps: 30k

For more detailed training scripts and datasets, please refer to the InternVLA-M1 GitHub Repo.

πŸ“ˆ Model Zoo

We release a series of pretrained models and checkpoints to facilitate reproduction and downstream use.

βœ… Available Checkpoints

Model Description Link
InternVLA-M1 Main pretrained model πŸ€— Hugging Face
InternVLA-M1-Pretrain-RT-1-Bridge Pretraining on RT-1 Bridge data πŸ€— Hugging Face
InternVLA-M1-LIBERO-Long Fine-tuned on LIBERO Long-horizon tasks πŸ€— Hugging Face
InternVLA-M1-LIBERO-Goal Fine-tuned on LIBERO Goal-conditioned tasks πŸ€— Hugging Face
InternVLA-M1-LIBERO-Spatial Fine-tuned on LIBERO Spatial reasoning tasks πŸ€— Hugging Face
InternVLA-M1-LIBERO-Object Fine-tuned on LIBERO Object-centric tasks πŸ€— Hugging Face

🀝 Contributing

We welcome contributions via Pull Requests or Issues. Please include detailed logs and reproduction steps when reporting bugs.

πŸ“œ Citation

If you find this useful in your research, please consider citing:

@article{internvlam1,
  title   = {InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy},
  author  = {InternVLA-M1 Contributors},
  journal = {arXiv preprint arXiv:2510.13778},
  year    = {2025}
}

πŸ“¬ Contact

  • Issues: Submit via GitHub Issues with detailed logs and steps

πŸ™ Acknowledgements

We thank the open-source community for their inspiring work. This project builds upon and is inspired by the following projects (alphabetical order):

  • IPEC-COMMUNITY: Curated OXE / LIBERO style multi-task datasets and formatting examples.
  • Isaac-GR00T: Standardized action data loader (GR00T-LeRobot).
  • Qwen2.5-VL: Multimodal input/output format, data loader, and pretrained VLM backbone.
  • CogACT: Reference for a DiT-style action head design.
  • Llavavla: Baseline code structure and engineering design references.
  • GenManip Simulation Platform: Simulation platform for generalizable pick-and-place based on Isaac Sim.

Thanks for using InternVLA-M1! 🌟 If you find it useful, please consider giving us a ⭐ on GitHub.