--- license: mit pipeline_tag: any-to-any language: - zh - en --- # InteractiveOmni
InteractiveOmni-4B 🤗 | InteractiveOmni-8B 🤗 | 📑 Paper
## Introduction InteractiveOmni is a unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and video and directly generate coherent text and speech streams, achieving truly integrated interaction. This is the schematic diagram for multi-turn audio-visual interaction.
### Key Features * **Strong Performance Across Modalities:** Exhibiting omni-modal understanding and speech generation capabilities. InteractiveOmni outperforms the similarly sized vision-language models, audio-language models and omni-modal models. * **State-of-the-Art Performance:** Achieve SOTA results on various open-source benchmarks for image, audio, and video understanding, as well as speech conversation. * **Excellent Interactive Performance:** Achieve more intelligent audio-visual experience with multi-turn and long-term memory capabilities. * **Multi-turn Interactive Benchmarks:** Propose multi-modal multi-turn benchmark to evaluate multi-turn memory and speech interaction of leading MLLMs. * **On-device Model:** the 4B model achieves 97% of the performance with just 50% of the model size compared with 8B model. ### Model Architecture
## Quickstart ### Get the Code ```bash git clone https://github.com/OpenSenseNova/InteractiveOmni.git cd InteractiveOmni pip install -r requirements.txt ``` We provide an example code to run `InteractiveOmni` using 🤗 `Transformers`. > Please use transformers>=4.51.0 and FlashAttention2 to ensure the model works normally. ### Model Loading ```python import torch from transformers import AutoTokenizer, AutoModel path = "sensenova/InteractiveOmni-8B" model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, trust_remote_code=True).eval().cuda() ``` ### Inference with Transformers ```python import torch from transformers import AutoModel, AutoTokenizer import torchaudio path = "sensenova/InteractiveOmni-8B" model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=True) # set the max number of tiles in `max_num` max_num = 12 frame = 8 generation_config = dict(max_new_tokens=1024, do_sample=True) # pure-text conversation (纯文本对话) messages = [ { 'role': "user", 'content': 'Hello, who are you?', } ] response = model.chat(tokenizer, generation_config, messages) # audio conversation (音频对话) messages = [ { 'role': "user", 'content': [ { "type": "audio", "audio": "assets/hello_en.wav" } ] } ] response = model.chat(tokenizer, generation_config, messages) ## Generate both audio and text output messages = [ { 'role': "user", 'content': [ { "type": "audio", "audio": "assets/hello_zh.wav" } ] } ] response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True) torchaudio.save("result.wav", wav_response.cpu(), 24000, format="wav") # image-text conversation (图文对话) messages = [ { 'role': "user", 'content': [ { "type": "image", "image": 'assets/cat_cup.jpeg' }, { "type": "text", "text": "Please describe the image shortly." } ] } ] response = model.chat(tokenizer, generation_config, messages, max_num) # image-audio conversation (图音对话) messages = [ { 'role': "user", 'content': [ { "type": "image", "image": 'assets/cat_cup.jpeg' }, { "type": "audio", "audio": "assets/describe_img_en.wav" } ] } ] response = model.chat(tokenizer, generation_config, messages, max_num) ## image-audio conversation, generate both audio and text output messages = [ { 'role': "user", 'content': [ { "type": "image", "image": 'assets/cat_cup.jpeg' }, { "type": "audio", "audio": "assets/describe_img_en.wav" } ] } ] response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True) torchaudio.save("result.wav", wav_response.cpu(), 24000, format="wav") # video conversation (视频对话) messages = [ { 'role': "user", 'content': [ { "type": "video", "video": 'video_path' }, { "type": "text", "text": "Describe this video in detail." } ] } ] response = model.chat(tokenizer, generation_config, messages, max_num, frame) ``` ### Use audio output * If users need audio output, the system prompt must be set as follows, otherwise the audio output may not work as expected. ``` You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech. ``` ```python messages = [ { "role": "system", "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech." }, { 'role': "user", 'content': [ { "type": "audio", "audio": "assets/hello_zh.wav", } ] } ] response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True) torchaudio.save("result_none_speaker.wav", wav_response.cpu(), 24000, format="wav") ``` * Use default speaker to generate output audio. ```python messages = [ { "role": "system", "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech." }, { 'role': "user", 'content': [ { "type": "audio", "audio": "assets/hello_zh.wav", } ] } ] response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True, speaker_embedding=model.default_speaker_embedding) torchaudio.save("result_default_speaker.wav", wav_response.cpu(), 24000, format="wav") ``` * Use custom speaker to generate output audio, similar to sound cloning. ```python messages = [ { "role": "system", "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech." }, { 'role': "user", 'content': [ { "type": "audio", "audio": "assets/hello_zh.wav", } ] } ] speaker_embedding = model.extract_speaker_embedding("assets/hello_zh.wav") response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True, speaker_embedding=speaker_embedding) torchaudio.save("result_custom_speaker.wav", wav_response.cpu(), 24000, format="wav") ``` ## Evaluation InteractiveOmni achieves state-of-the-art performance across a wide range of multi-modal understanding and speech generation benchmarks.
Image Understanding
Model
MMBench
MMStar
MMMU
MathVista
HallusionBench
AI2D
OCRBench
Avg
Vision-Language Model
InternVL3-8B
82.1
68.7
62.2
70.5
49.0
85.1
88.4
72.3
InternVL3.5-8B
79.5
69.3
73.4
78.4
54.5
84.0
84.0
74.7
Qwen2.5-VL-7B
82.2
64.1
58.0
68.1
51.9
84.3
88.8
71.1
Omni Model
GPT-4o-mini
76.0
54.8
60.0
52.5
46.1
77.8
78.5
63.7
VITA-1.5
76.8
60.2
52.6
66.2
44.6
79.2
74.1
64.8
Ming-Lite-Omni
80.8
64.7
56.3
71.6
55.0
83.1
88.4
71.4
Qwen2.5-Omni-7B
81.3
64.0
59.2
67.9
47.4
83.2
83.4
69.5
InteractiveOmni-4B
78.9
62.6
61.1
61.7
52.2
83.8
80.0
68.6
InteractiveOmni-8B
81.4
66.8
66.9
68.0
61.3
84.3
83.7
73.2
Video Understanding
Model
Video-MME
(wo sub)Video-MME
(w sub)MLVU
(M-Avg)LongVideoBench
(val total)Avg
Vision-Language Model
InternVL3-8B
66.3
68.9
71.4
58.8
66.4
InternVL3.5-8B
66.0
68.6
70.2
62.1
66.7
Qwen2.5-VL-7B
65.1
71.6
70.2
56.0
64.5
Omni Model
GPT-4o-mini
64.8
-
-
-
-
Qwen2.5-Omni-7B
64.3
72.4
-
-
-
InteractiveOmni-4B
63.3
69.3
68.0
57.0
64.4
InteractiveOmni-8B
66.0
71.8
71.6
59.1
67.1
Audio Understanding
Model
Qwen2-Audio
Step-Audio-Chat
Kimi-Audio
Qwen2.5-Omni-7B
InteractiveOmni-4B
InteractiveOmni-8B
ASR (wer)
Wenetspeech
test-net10.60
8.75
5.37
5.90
5.40
5.04
Wenetspeech
test-meeting10.68
9.52
6.28
7.70
6.95
5.55
LibriSpeech
test-clean1.60
3.19
1.28
1.80
1.73
1.64
LibriSpeech
test-other3.60
10.67
2.42
3.40
3.69
3.41
Aishell-2 IOS
4.48
3.57
2.56
2.56
2.85
2.18
ChildMandarin
14.62
-
-
19.34
17.21
14.03
Audio Understanding
MMAU
56.60
-
65.20
65.60
72.00
67.39
MELD
55.30
33.54
59.13
57.00
57.16
57.55
ClothoAQA
dev72.63
44.98
73.18
73.12
71.91
72.98
ClothoAQA
test71.73
45.84
71.24
72.86
71.28
74.49
Omni-modal Understanding
Model
Speech
Sound Event
Music
Avg
OmniBench
MiniCPM-o-2.6
-
-
-
40.50
Baichuan-Omni-1.5
-
-
-
42.90
Qwen2.5-Omni-7B
55.25
60.00
52.83
56.13
InteractiveOmni-4B
60.70
61.51
42.45
59.19
InteractiveOmni-8B
60.18
62.64
55.66
60.33
Speech-to-text
Datasets
Model
Performance
OpenAudioBench
Reasoning QA | Llama Questions
| Web Questions | TriviaQA
| AlpacaEval | AvgQwen2-Audio
42.77 | 69.67 | 45.20 | 40.30 | 57.19 | 51.03
GLM-4-Voice
47.43 | 76.00 | 55.40 | 51.80 | 57.89 | 57.70
VITA-1.5
41.00 | 74.20 | 57.30 | 46.80 | 68.20 | 57.50
Step-Audio-chat
60.00 | 72.33 | 73.00 | 56.80 | 56.53 | 63.73
Baichuan-Audio
41.90 | 78.40 | 64.50 | 61.70 | 77.40 | 64.78
Kimi-Audio
58.02 | 79.33 | 70.20 | 62.10 | 75.73 | 69.08
MiniCPM-o-2.6
38.60 | 77.80 | 68.60 | 61.90 | 51.80 | 59.74
Baichuan-Omni-1.5
50.00 | 78.50 | 59.10 | 57.20 | 77.90 | 64.54
Qwen2.5-Omni-7B
63.76 | 75.33 | 62.80 | 57.06 | 72.76 | 66.34
InteractiveOmni-4B
69.11 | 79.33 | 65.80 | 56.40 | 74.87 | 69.10
InteractiveOmni-8B
71.68 | 80.67 | 70.30 | 66.50 | 74.57 | 72.74
VoiceBench
AlpacaEval | CommonEval
| WildVoice | SD-QA | MMSUQwen2-Audio
3.69 | 3.40 | 3.01 | 35.35 | 35.43
GLM-4-Voice
4.06 | 3.48 | 3.18 | 43.31 | 40.11
VITA-1.5
4.21 | 3.66 | 3.48 | 38.88 | 52.15
Step-Audio-chat
3.99 | 2.99 | 2.93 | 46.84 | 28.72
Baichuan-Audio
4.41 | 4.08 | 3.92 | 45.84 | 53.19
Kimi-Audio
4.46 | 3.97 | 4.20 | 63.12 | 62.17
MiniCPM-o-2.6
4.42 | 4.15 | 3.94 | 50.72 | 54.78
Baichuan-Omni-1.5
4.50 | 4.05 | 4.06 | 43.40 | 57.25
Qwen2.5-Omni-7B
4.50 | 3.84 | 3.89 | 56.40 | 61.32
InteractiveOmni-4B
4.27 | 4.20 | 3.94 | 41.41 | 63.24
InteractiveOmni-8B
4.61 | 4.34 | 4.21 | 44.67 | 65.26
VoiceBench
OpenBookQA | IFEval
| BBH | AdvBench | AvgQwen2-Audio
49.01 | 54.70 | 22.57 | 98.85 | 55.32
GLM-4-Voice
52.97 | 52.80 | 24.91 | 88.08 | 57.40
VITA-1.5
71.65 | 55.30 | 38.14 | 97.69 | 64.53
Step-Audio-chat
31.87 | 50.60 | 29.19 | 65.77 | 50.13
Baichuan-Audio
71.65 | 54.80 | 50.31 | 99.42 | 69.27
Kimi-Audio
83.52 | 69.70 | 61.10 | 100.0 | 76.91
MiniCPM-o-2.6
78.02 | 60.40 | 49.25 | 97.69 | 71.23
Baichuan-Omni-1.5
74.51 | 62.70 | 54.54 | 97.31 | 71.32
Qwen2.5-Omni-7B
80.90 | 66.70 | 53.50 | 99.20 | 73.60
InteractiveOmni-4B
82.64 | 55.90 | 60.90 | 99.62 | 73.10
InteractiveOmni-8B
86.37 | 73.30 | 57.99 | 99.42 | 76.69
Speech Generation
Model
test-zh
test-en
test-zh-hard
TTS Model
MaskGCT
2.27
2.62
10.27
SeedTTS
1.12
2.25
7.59
CosyVoice 2
1.45
2.57
6.83
MLLM
MinMo
2.48
2.90
-
Ming-Lite-Omni
1.69
4.31
-
Qwen2.5-Omni-7B
1.70
2.72
7.97
InteractiveOmni-4B
1.37
3.73
8.02
InteractiveOmni-8B
1.56
2.33
7.92