Mantis

This is the official checkpoint of Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Paper: https://arxiv.org/pdf/2511.16175
Code: https://github.com/zhijie-group/Mantis

🔥 Highlights

Disentangled Visual Foresight augments action learning without overburdening the backbone.
Progressive Training preserves the understanding capabilities of the backbone.
Adaptive Temporal Ensemble reduces inference cost while maintaining stable control.

How to use

This is the Mantis model trained on the LIBERO spatial dataset. For detailed usage please refer to our repository.

📝 Citation

If you find our code or models useful in your work, please cite our paper:

@article{yang2025mantis,
  title={Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight},
  author={Yang, Yi and Li, Xueqi and Chen, Yiyang and Song, Jin and Wang, Yihan and Xiao, Zipeng and Su, Jiadi and Qiaoben, You and Liu, Pengfei and Deng, Zhijie},
  journal={arXiv preprint arXiv:2511.16175},
  year={2025}
}

Downloads last month: 57

Safetensors

Model size

6B params

Tensor type

F32

BF16

Video Preview

Robotics

Collection including Yysrc/LIBERO-Spatial

Mantis

Collection

mantis • 8 items • Updated 12 days ago