|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
tags: |
|
|
- MoE |
|
|
- Omnimodal Large Model |
|
|
- Speech-Driven Multimodal Interaction |
|
|
- Image Generating and Editing |
|
|
pipeline_tag: any-to-any |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-7B-Instruct |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
<h1 align="center">Uni-MoE 2.0-Omni</h1> |
|
|
|
|
|
**Uni-MoE 2.0** is a fully open-source omnimodal model that substantially advances the capabilities of Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. It is powered by Omnimodality 3D RoPE and Dynamic-Capacity Mixture-of-Experts architecture. |
|
|
|
|
|
**Uni-MoE 2.0-Omni** is the version of the Uni-MoE 2.0 series that integrates full-modality understanding, as well as audio and image generation capabilities |
|
|
|
|
|
<div align="center" style="display: flex; justify-content: center; margin-top: 10px;"> |
|
|
<a href="https://idealistxy.github.io/Uni-MoE-v2.github.io/"><img src="https://img.shields.io/badge/π° -Website-228B22" style="margin-right: 5px;"></a> |
|
|
<a href="https://arxiv.org/abs/2511.12609"><img src="https://img.shields.io/badge/π-Paper-8A2BE2" style="margin-right: 5px;"></a> |
|
|
<a href="https://github.com/HITsz-TMG/Uni-MoE"><img src="https://img.shields.io/badge/π¨βπ»-Codes-007ACC" style="margin-right: 5px;"></a> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
**If you enjoy our work or want timely updates, please give us a like and follow us.** |
|
|
|
|
|
## Open-source Plan |
|
|
- [x] Model Checkpoint |
|
|
- [x] [Uni-MoE 2.0-Omni](https://huggingface.co/HIT-TMG/Uni-MoE-2.0-Omni) |
|
|
- [x] [Uni-MoE 2.0-Base](https://huggingface.co/HIT-TMG/Uni-MoE-2.0-Base) |
|
|
- [x] [Uni-MoE 2.0-Thinking](https://huggingface.co/HIT-TMG/Uni-MoE-2.0-Thinking) |
|
|
- [x] [Uni-MoE 2.0-Image](https://huggingface.co/HIT-TMG/Uni-MoE-2.0-Image) |
|
|
- [x] [Uni-MoE 2.0-MoE-TTS](https://huggingface.co/HIT-TMG/Uni-MoE-TTS) |
|
|
- [x] Inference Code: [HITsz-TMG/Uni-MoE-2.0](https://github.com/HITsz-TMG/Uni-MoE/tree/master/Uni-MoE-2) |
|
|
- [x] Training Code: [HITsz-TMG/Uni-MoE-2.0](https://github.com/HITsz-TMG/Uni-MoE/tree/master/Uni-MoE-2) |
|
|
- [x] Technical Report: [arxiv](https://arxiv.org/abs/2511.12609) |
|
|
|
|
|
## Main Results |
|
|
 |
|
|
|
|
|
## Model Introduction |
|
|
|
|
|
<video controls playsinline width="100%" src="https://huggingface.co/HIT-TMG/Uni-MoE-2.0-Omni/resolve/main/imgs/audio.mp4"> |
|
|
</video> |
|
|
|
|
|
<video controls playsinline width="100%" src="https://huggingface.co/HIT-TMG/Uni-MoE-2.0-Omni/resolve/main/imgs/omni.mp4"> |
|
|
</video> |
|
|
|
|
|
## Getting Started |
|
|
|
|
|
### 1. Clone this repository and navigate to the Uni-MoE 2.0 folder |
|
|
```bash |
|
|
git clone https://github.com/HITsz-TMG/Uni-MoE.git |
|
|
cd Uni-MoE-2 |
|
|
``` |
|
|
### 2. Set up environment |
|
|
Install the evaluation environment according to the requirements. |
|
|
```bash |
|
|
conda create -n uni_moe_2 python=3.11 |
|
|
conda activate uni_moe_2 |
|
|
pip install torch==2.5.1 torchaudio==2.5.1 torchvision==0.20.1 |
|
|
pip install -r requirements.txt |
|
|
pip install flash-attn==2.6.0.post1 --no-build-isolation |
|
|
pip install clip==1.0@git+https://github.com/openai/CLIP.git@dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1 |
|
|
``` |
|
|
|
|
|
## Example Usage |
|
|
We provide a simple example on the usage of this repo. For detailed usage, please refer to [cookbook](https://github.com/HITsz-TMG/Uni-MoE/tree/master/Uni-MoE-2/examples) |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from uni_moe.model.processing_qwen2_vl import Qwen2VLProcessor |
|
|
from uni_moe.model.modeling_out import GrinQwen2VLOutForConditionalGeneration |
|
|
from uni_moe.qwen_vl_utils import process_mm_info |
|
|
from uni_moe.model import deepspeed_moe_inference_utils |
|
|
|
|
|
processor = Qwen2VLProcessor.from_pretrained("HIT-TMG/Uni-MoE-2.0-Omni") |
|
|
|
|
|
model = GrinQwen2VLOutForConditionalGeneration.from_pretrained("HIT-TMG/Uni-MoE-2.0-Omni", torch_dtype=torch.bfloat16).cuda() |
|
|
|
|
|
processor.data_args = model.config |
|
|
|
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "text", "text": "<audio>\n<image>\nAnswer the question in the audio."}, |
|
|
{"type": "audio", "audio": "examples/assets/audio/quick_start.mp3"}, |
|
|
{"type": "image", "image": "examples/assets/image/quick_start.jpg"} |
|
|
] |
|
|
}] |
|
|
|
|
|
texts = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
texts = texts.replace("<image>","<|vision_start|><|image_pad|><|vision_end|>").replace("<audio>","<|audio_start|><|audio_pad|><|audio_end|>").replace("<video>","<|vision_start|><|video_pad|><|vision_end|>") |
|
|
image_inputs, video_inputs, audio_inputs = process_mm_info(messages) |
|
|
|
|
|
inputs = processor( |
|
|
text=texts, |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
audios=audio_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
inputs["input_ids"] = inputs["input_ids"].unsqueeze(0) |
|
|
|
|
|
inputs = inputs.to(device=model.device) |
|
|
|
|
|
output_ids = model.generate( |
|
|
**inputs, |
|
|
use_cache=True, |
|
|
pad_token_id=processor.tokenizer.eos_token_id, |
|
|
max_new_tokens=4096, |
|
|
temperature=1.0, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
text = processor.batch_decode(output_ids[:, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)[0] |
|
|
print(text) |
|
|
|
|
|
``` |
|
|
|
|
|
<!-- # Citation |
|
|
|
|
|
Please cite the repo if you use the model or code in this repo. |
|
|
|
|
|
``` |
|
|
|
|
|
``` --> |