Any-to-Any
Transformers
Safetensors
chameleon
image-to-text
multimodal
reasoning
sft
rl
File size: 1,138 Bytes
20b00ad
 
 
 
 
 
 
 
 
 
 
 
 
daf22e7
20b00ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
library_name: transformers
tags:
- multimodal
- reasoning
- sft
- rl
datasets:
- LightChen2333/M3CoT
- ModalityDance/Omni-Bench
base_model:
- GAIR/Anole-7b-v0.1
license: mit
pipeline_tag: any-to-any
---

# Omni-R1-Zero

Omni-R1-Zero is trained without multimodal annotations. It bootstraps step-wise visualizations from text-only CoT seeds, then follows the SFT→RL recipe to learn interleaved multimodal reasoning.

<p align="center">
  <a href="https://arxiv.org/abs/2601.09536"><b>Paper</b>👁️</a> ·
  <a href="https://github.com/ModalityDance/Omni-R1"><b>Code</b>🐙</a> ·
  <a href="https://huggingface.co/datasets/ModalityDance/Omni-Bench"><b>Omni-Bench</b>🧪</a>
</p>

## Citation
```bibtex
@misc{cheng2026omnir1unifiedgenerativeparadigm,
      title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, 
      author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
      year={2026},
      eprint={2601.09536},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.09536}, 
}
```