Text-to-Speech
ONNX
Safetensors
nielsr HF Staff commited on
Commit
eee7cc3
·
verified ·
1 Parent(s): 5646a54

Update model card for Fun-Audio-Chat-8B

Browse files

This PR updates the model card to reflect the Fun-Audio-Chat-8B model as described in the [Fun-Audio-Chat Technical Report](https://huggingface.co/papers/2512.20156).

Key changes include:
- Added `library_name: transformers` to the metadata (verified via `config.json`).
- Updated the content to describe the Fun-Audio-Chat architecture, including Dual-Resolution Speech Representations (DRSR) and Core-Cocktail Training.
- Added links to the official GitHub repository and project page.
- Provided quick start instructions based on the GitHub README.
- Updated the citation section to include the latest technical report.

Files changed (1) hide show
  1. README.md +44 -200
README.md CHANGED
@@ -1,5 +1,4 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - zh
5
  - en
@@ -10,226 +9,71 @@ language:
10
  - it
11
  - ru
12
  - de
 
13
  pipeline_tag: text-to-speech
 
14
  ---
15
 
16
- ![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)
17
-
18
- ## 👉🏻 CosyVoice 👈🏻
19
 
20
- **Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
21
 
22
- **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)
23
 
24
- **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M)
25
 
26
- ## Highlight🔥
27
 
28
- **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
29
  ### Key Features
30
- - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
31
- - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
32
- - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
33
- - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
34
- - **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
35
- - **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
36
-
37
-
38
- ## Roadmap
39
-
40
- - [x] 2025/12
41
-
42
- - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
43
- - [x] release Fun-CosyVoice3-0.5B modelscope gradio space
44
-
45
- - [x] 2025/08
46
-
47
- - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
48
-
49
- - [x] 2025/07
50
-
51
- - [x] release Fun-CosyVoice 3.0 eval set
52
-
53
- - [x] 2025/05
54
-
55
- - [x] add CosyVoice2-0.5B vllm support
56
-
57
- - [x] 2024/12
58
-
59
- - [x] 25hz CosyVoice2-0.5B released
60
-
61
- - [x] 2024/09
62
-
63
- - [x] 25hz CosyVoice-300M base model
64
- - [x] 25hz CosyVoice-300M voice conversion function
65
-
66
- - [x] 2024/08
67
-
68
- - [x] Repetition Aware Sampling(RAS) inference for llm stability
69
- - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization
70
-
71
- - [x] 2024/07
72
-
73
- - [x] Flow matching training support
74
- - [x] WeTextProcessing support when ttsfrd is not available
75
- - [x] Fastapi server and client
76
-
77
- ## Evaluation
78
 
79
- | Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) ↑ |
80
- | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
81
- | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
82
- | Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
83
- | MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
84
- | F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
85
- | Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
86
- | CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
87
- | FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
88
- | Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
89
- | VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
90
- | VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
91
- | HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
92
- | VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
93
- | GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
94
- | GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
95
- | Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
96
- | Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
97
 
 
98
 
99
- ## Install
100
-
101
- ### Clone and install
102
-
103
- - Clone the repo
104
- ``` sh
105
- git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
106
- # If you failed to clone the submodule due to network failures, please run the following command until success
107
- cd CosyVoice
108
- git submodule update --init --recursive
109
- ```
110
-
111
- - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
112
- - Create Conda env:
113
-
114
- ``` sh
115
- conda create -n cosyvoice -y python=3.10
116
- conda activate cosyvoice
117
- pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
118
-
119
- # If you encounter sox compatibility issues
120
- # ubuntu
121
- sudo apt-get install sox libsox-dev
122
- # centos
123
- sudo yum install sox sox-devel
124
- ```
125
-
126
- ### Model download
127
-
128
- ``` python
129
- from huggingface_hub import snapshot_download
130
- snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
131
- snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
132
  ```
133
 
134
- Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.
135
-
136
- Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.
137
 
138
- ``` sh
139
- cd pretrained_models/CosyVoice-ttsfrd/
140
- unzip resource.zip -d .
141
- pip install ttsfrd_dependency-0.1-py3-none-any.whl
142
- pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
143
- ```
144
 
145
- ### Basic Usage
146
-
147
- ``` python
148
- import sys
149
- sys.path.append('third_party/Matcha-TTS')
150
- from cosyvoice.cli.cosyvoice import AutoModel
151
- import torchaudio
152
-
153
- """ CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
154
- """
155
- cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
156
- # en zero_shot usage
157
- for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
158
- './asset/zero_shot_prompt.wav', stream=False)):
159
- torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
160
- # zh zero_shot usage
161
- for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
162
- './asset/zero_shot_prompt.wav', stream=False)):
163
- torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
164
-
165
- # fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L280
166
- for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点,[breath]邻居都很活络,[breath]嗯,都很熟悉。[breath]',
167
- './asset/zero_shot_prompt.wav', stream=False)):
168
- torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
169
-
170
- # instruct usage, for supported control, check cosyvoice/utils/common.py#L28
171
- for i, j in enumerate(cosyvoice.inference_instruct2('好少咯,一般系放嗰啲国庆啊,中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',
172
- './asset/zero_shot_prompt.wav', stream=False)):
173
- torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
174
- for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',
175
- './asset/zero_shot_prompt.wav', stream=False)):
176
- torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
177
-
178
- # hotfix usage
179
- for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
180
- './asset/zero_shot_prompt.wav', stream=False)):
181
- torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
182
  ```
183
 
184
- ## Discussion & Communication
185
-
186
- You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
187
-
188
- You can also scan the QR code to join our official Dingding chat group.
189
 
190
- <img src="./asset/dingding.png" width="250px">
191
 
192
- ## Acknowledge
193
-
194
- 1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).
195
- 2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).
196
- 3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
197
- 4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).
198
- 5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).
199
-
200
- ## Citations
201
-
202
- ``` bibtex
203
- @article{du2024cosyvoice,
204
- title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
205
- author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
206
- journal={arXiv preprint arXiv:2407.05407},
207
- year={2024}
208
- }
209
-
210
- @article{du2024cosyvoice,
211
- title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
212
- author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
213
- journal={arXiv preprint arXiv:2412.10117},
214
- year={2024}
215
- }
216
-
217
- @article{du2025cosyvoice,
218
- title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
219
- author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
220
- journal={arXiv preprint arXiv:2505.17589},
221
- year={2025}
222
  }
223
 
224
- @inproceedings{lyu2025build,
225
- title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
226
- author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
227
- booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
228
- pages={1--2},
229
  year={2025},
230
- organization={IEEE}
 
 
 
231
  }
232
- ```
233
-
234
- ## Disclaimer
235
- The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
 
1
  ---
 
2
  language:
3
  - zh
4
  - en
 
9
  - it
10
  - ru
11
  - de
12
+ license: apache-2.0
13
  pipeline_tag: text-to-speech
14
+ library_name: transformers
15
  ---
16
 
17
+ # Fun-Audio-Chat-8B
 
 
18
 
19
+ **Fun-Audio-Chat** is a Large Audio Language Model (LALM) built for natural, low-latency voice interactions. It addresses challenges like temporal resolution mismatch and catastrophic forgetting through two key innovations: **Dual-Resolution Speech Representations (DRSR)** and **Core-Cocktail Training**.
20
 
21
+ [[Technical Report](https://huggingface.co/papers/2512.20156)] [[GitHub](https://github.com/FunAudioLLM/Fun-Audio-Chat)] [[Project Page](https://funaudiollm.github.io/funaudiochat)]
22
 
23
+ ## Overview
24
 
25
+ Fun-Audio-Chat introduces a dual-stream approach to balance efficiency and quality. The shared LLM backbone processes audio at an efficient 5Hz frame rate, while a Speech Refined Head generates high-quality tokens at 25Hz, reducing GPU computation by approximately 50% compared to standard models. To prevent the loss of text-based reasoning capabilities, the model employs Core-Cocktail training, a two-stage fine-tuning process with intermediate merging.
26
 
 
27
  ### Key Features
28
+ - **Dual-Resolution Speech Representations**: Efficient 5Hz processing for semantics and 25Hz for high-quality generation.
29
+ - **State-of-the-Art Performance**: Ranks top among similar-scale models (8B) on benchmarks like OpenAudioBench, VoiceBench, and UltraEval-Audio.
30
+ - **Comprehensive Capabilities**: Supports spoken QA, audio understanding, speech function calling, instruction-following, and voice empathy.
31
+ - **Multilingual Support**: Covers Chinese, English, French, Spanish, Japanese, Korean, Italian, Russian, and German.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
+ ### Installation
36
 
37
+ ```bash
38
+ git clone --recurse-submodules https://github.com/FunAudioLLM/Fun-Audio-Chat
39
+ cd Fun-Audio-Chat
40
+ pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```
42
 
43
+ ### Inference
 
 
44
 
45
+ The repository provides example scripts for Speech-to-Text (S2T) and Speech-to-Speech (S2S) tasks:
 
 
 
 
 
46
 
47
+ ```bash
48
+ export PYTHONPATH=`pwd`
49
+ # Run Speech-to-Text inference
50
+ python examples/infer_s2t.py
51
+ # Run Speech-to-Speech inference
52
+ python examples/infer_s2s.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
+ ## Citation
 
 
 
 
56
 
57
+ If you find this model useful, please cite the technical report:
58
 
59
+ ```bibtex
60
+ @article{funaudiochat2025,
61
+ title={Fun-Audio-Chat Technical Report},
62
+ author={Qian Chen and Luyao Cheng and Chong Deng and Xiangang Li and Jiaqing Liu and Chao-Hong Tan and Wen Wang and Junhao Xu and Jieping Ye and Qinglin Zhang and Qiquan Zhang and Jingren Zhou},
63
+ year={2025},
64
+ eprint={2512.20156},
65
+ archivePrefix={arXiv},
66
+ primaryClass={cs.CL},
67
+ url={https://arxiv.org/abs/2512.20156},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  }
69
 
70
+ @misc{tan2025drvoiceparallelspeechtextvoice,
71
+ title={DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations},
72
+ author={Chao-Hong Tan and Qian Chen and Wen Wang and Chong Deng and Qinglin Zhang and Luyao Cheng and Hai Yu and Xin Zhang and Xiang Lv and Tianyu Zhao and Chong Zhang and Yukun Ma and Yafeng Chen and Hui Wang and Jiaqing Liu and Xiangang Li and Jieping Ye},
 
 
73
  year={2025},
74
+ eprint={2506.09349},
75
+ archivePrefix={arXiv},
76
+ primaryClass={cs.CL},
77
+ url={https://arxiv.org/abs/2506.09349},
78
  }
79
+ ```