Update model card for Fun-Audio-Chat-8B
Browse filesThis PR updates the model card to reflect the Fun-Audio-Chat-8B model as described in the [Fun-Audio-Chat Technical Report](https://huggingface.co/papers/2512.20156).
Key changes include:
- Added `library_name: transformers` to the metadata (verified via `config.json`).
- Updated the content to describe the Fun-Audio-Chat architecture, including Dual-Resolution Speech Representations (DRSR) and Core-Cocktail Training.
- Added links to the official GitHub repository and project page.
- Provided quick start instructions based on the GitHub README.
- Updated the citation section to include the latest technical report.
README.md
CHANGED
|
@@ -1,5 +1,4 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- zh
|
| 5 |
- en
|
|
@@ -10,226 +9,71 @@ language:
|
|
| 10 |
- it
|
| 11 |
- ru
|
| 12 |
- de
|
|
|
|
| 13 |
pipeline_tag: text-to-speech
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
## 👉🏻 CosyVoice 👈🏻
|
| 19 |
|
| 20 |
-
**Fun-
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
**Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
|
| 29 |
### Key Features
|
| 30 |
-
- **
|
| 31 |
-
- **
|
| 32 |
-
- **
|
| 33 |
-
- **
|
| 34 |
-
- **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
|
| 35 |
-
- **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
## Roadmap
|
| 39 |
-
|
| 40 |
-
- [x] 2025/12
|
| 41 |
-
|
| 42 |
-
- [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
|
| 43 |
-
- [x] release Fun-CosyVoice3-0.5B modelscope gradio space
|
| 44 |
-
|
| 45 |
-
- [x] 2025/08
|
| 46 |
-
|
| 47 |
-
- [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
|
| 48 |
-
|
| 49 |
-
- [x] 2025/07
|
| 50 |
-
|
| 51 |
-
- [x] release Fun-CosyVoice 3.0 eval set
|
| 52 |
-
|
| 53 |
-
- [x] 2025/05
|
| 54 |
-
|
| 55 |
-
- [x] add CosyVoice2-0.5B vllm support
|
| 56 |
-
|
| 57 |
-
- [x] 2024/12
|
| 58 |
-
|
| 59 |
-
- [x] 25hz CosyVoice2-0.5B released
|
| 60 |
-
|
| 61 |
-
- [x] 2024/09
|
| 62 |
-
|
| 63 |
-
- [x] 25hz CosyVoice-300M base model
|
| 64 |
-
- [x] 25hz CosyVoice-300M voice conversion function
|
| 65 |
-
|
| 66 |
-
- [x] 2024/08
|
| 67 |
-
|
| 68 |
-
- [x] Repetition Aware Sampling(RAS) inference for llm stability
|
| 69 |
-
- [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization
|
| 70 |
-
|
| 71 |
-
- [x] 2024/07
|
| 72 |
-
|
| 73 |
-
- [x] Flow matching training support
|
| 74 |
-
- [x] WeTextProcessing support when ttsfrd is not available
|
| 75 |
-
- [x] Fastapi server and client
|
| 76 |
-
|
| 77 |
-
## Evaluation
|
| 78 |
|
| 79 |
-
|
| 80 |
-
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 81 |
-
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
|
| 82 |
-
| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
|
| 83 |
-
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
|
| 84 |
-
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
|
| 85 |
-
| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
|
| 86 |
-
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
|
| 87 |
-
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
|
| 88 |
-
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
|
| 89 |
-
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
|
| 90 |
-
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
|
| 91 |
-
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
|
| 92 |
-
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
|
| 93 |
-
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
|
| 94 |
-
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
|
| 95 |
-
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
|
| 96 |
-
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
|
| 97 |
|
|
|
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
- Clone the repo
|
| 104 |
-
``` sh
|
| 105 |
-
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
|
| 106 |
-
# If you failed to clone the submodule due to network failures, please run the following command until success
|
| 107 |
-
cd CosyVoice
|
| 108 |
-
git submodule update --init --recursive
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
- Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
|
| 112 |
-
- Create Conda env:
|
| 113 |
-
|
| 114 |
-
``` sh
|
| 115 |
-
conda create -n cosyvoice -y python=3.10
|
| 116 |
-
conda activate cosyvoice
|
| 117 |
-
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
|
| 118 |
-
|
| 119 |
-
# If you encounter sox compatibility issues
|
| 120 |
-
# ubuntu
|
| 121 |
-
sudo apt-get install sox libsox-dev
|
| 122 |
-
# centos
|
| 123 |
-
sudo yum install sox sox-devel
|
| 124 |
-
```
|
| 125 |
-
|
| 126 |
-
### Model download
|
| 127 |
-
|
| 128 |
-
``` python
|
| 129 |
-
from huggingface_hub import snapshot_download
|
| 130 |
-
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
|
| 131 |
-
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
|
| 132 |
```
|
| 133 |
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.
|
| 137 |
|
| 138 |
-
|
| 139 |
-
cd pretrained_models/CosyVoice-ttsfrd/
|
| 140 |
-
unzip resource.zip -d .
|
| 141 |
-
pip install ttsfrd_dependency-0.1-py3-none-any.whl
|
| 142 |
-
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
|
| 143 |
-
```
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
import torchaudio
|
| 152 |
-
|
| 153 |
-
""" CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
|
| 154 |
-
"""
|
| 155 |
-
cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
|
| 156 |
-
# en zero_shot usage
|
| 157 |
-
for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
|
| 158 |
-
'./asset/zero_shot_prompt.wav', stream=False)):
|
| 159 |
-
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
|
| 160 |
-
# zh zero_shot usage
|
| 161 |
-
for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
|
| 162 |
-
'./asset/zero_shot_prompt.wav', stream=False)):
|
| 163 |
-
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
|
| 164 |
-
|
| 165 |
-
# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L280
|
| 166 |
-
for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点,[breath]邻居都很活络,[breath]嗯,都很熟悉。[breath]',
|
| 167 |
-
'./asset/zero_shot_prompt.wav', stream=False)):
|
| 168 |
-
torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
|
| 169 |
-
|
| 170 |
-
# instruct usage, for supported control, check cosyvoice/utils/common.py#L28
|
| 171 |
-
for i, j in enumerate(cosyvoice.inference_instruct2('好少咯,一般系放嗰啲国庆啊,中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',
|
| 172 |
-
'./asset/zero_shot_prompt.wav', stream=False)):
|
| 173 |
-
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
|
| 174 |
-
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',
|
| 175 |
-
'./asset/zero_shot_prompt.wav', stream=False)):
|
| 176 |
-
torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
|
| 177 |
-
|
| 178 |
-
# hotfix usage
|
| 179 |
-
for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
|
| 180 |
-
'./asset/zero_shot_prompt.wav', stream=False)):
|
| 181 |
-
torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
|
| 182 |
```
|
| 183 |
|
| 184 |
-
##
|
| 185 |
-
|
| 186 |
-
You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
|
| 187 |
-
|
| 188 |
-
You can also scan the QR code to join our official Dingding chat group.
|
| 189 |
|
| 190 |
-
|
| 191 |
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
``` bibtex
|
| 203 |
-
@article{du2024cosyvoice,
|
| 204 |
-
title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
|
| 205 |
-
author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
|
| 206 |
-
journal={arXiv preprint arXiv:2407.05407},
|
| 207 |
-
year={2024}
|
| 208 |
-
}
|
| 209 |
-
|
| 210 |
-
@article{du2024cosyvoice,
|
| 211 |
-
title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
|
| 212 |
-
author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
|
| 213 |
-
journal={arXiv preprint arXiv:2412.10117},
|
| 214 |
-
year={2024}
|
| 215 |
-
}
|
| 216 |
-
|
| 217 |
-
@article{du2025cosyvoice,
|
| 218 |
-
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
|
| 219 |
-
author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
|
| 220 |
-
journal={arXiv preprint arXiv:2505.17589},
|
| 221 |
-
year={2025}
|
| 222 |
}
|
| 223 |
|
| 224 |
-
@
|
| 225 |
-
title={
|
| 226 |
-
author={
|
| 227 |
-
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
|
| 228 |
-
pages={1--2},
|
| 229 |
year={2025},
|
| 230 |
-
|
|
|
|
|
|
|
|
|
|
| 231 |
}
|
| 232 |
-
```
|
| 233 |
-
|
| 234 |
-
## Disclaimer
|
| 235 |
-
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- zh
|
| 4 |
- en
|
|
|
|
| 9 |
- it
|
| 10 |
- ru
|
| 11 |
- de
|
| 12 |
+
license: apache-2.0
|
| 13 |
pipeline_tag: text-to-speech
|
| 14 |
+
library_name: transformers
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# Fun-Audio-Chat-8B
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
**Fun-Audio-Chat** is a Large Audio Language Model (LALM) built for natural, low-latency voice interactions. It addresses challenges like temporal resolution mismatch and catastrophic forgetting through two key innovations: **Dual-Resolution Speech Representations (DRSR)** and **Core-Cocktail Training**.
|
| 20 |
|
| 21 |
+
[[Technical Report](https://huggingface.co/papers/2512.20156)] [[GitHub](https://github.com/FunAudioLLM/Fun-Audio-Chat)] [[Project Page](https://funaudiollm.github.io/funaudiochat)]
|
| 22 |
|
| 23 |
+
## Overview
|
| 24 |
|
| 25 |
+
Fun-Audio-Chat introduces a dual-stream approach to balance efficiency and quality. The shared LLM backbone processes audio at an efficient 5Hz frame rate, while a Speech Refined Head generates high-quality tokens at 25Hz, reducing GPU computation by approximately 50% compared to standard models. To prevent the loss of text-based reasoning capabilities, the model employs Core-Cocktail training, a two-stage fine-tuning process with intermediate merging.
|
| 26 |
|
|
|
|
| 27 |
### Key Features
|
| 28 |
+
- **Dual-Resolution Speech Representations**: Efficient 5Hz processing for semantics and 25Hz for high-quality generation.
|
| 29 |
+
- **State-of-the-Art Performance**: Ranks top among similar-scale models (8B) on benchmarks like OpenAudioBench, VoiceBench, and UltraEval-Audio.
|
| 30 |
+
- **Comprehensive Capabilities**: Supports spoken QA, audio understanding, speech function calling, instruction-following, and voice empathy.
|
| 31 |
+
- **Multilingual Support**: Covers Chinese, English, French, Spanish, Japanese, Korean, Italian, Russian, and German.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
### Installation
|
| 36 |
|
| 37 |
+
```bash
|
| 38 |
+
git clone --recurse-submodules https://github.com/FunAudioLLM/Fun-Audio-Chat
|
| 39 |
+
cd Fun-Audio-Chat
|
| 40 |
+
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
```
|
| 42 |
|
| 43 |
+
### Inference
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
The repository provides example scripts for Speech-to-Text (S2T) and Speech-to-Speech (S2S) tasks:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
```bash
|
| 48 |
+
export PYTHONPATH=`pwd`
|
| 49 |
+
# Run Speech-to-Text inference
|
| 50 |
+
python examples/infer_s2t.py
|
| 51 |
+
# Run Speech-to-Speech inference
|
| 52 |
+
python examples/infer_s2s.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
```
|
| 54 |
|
| 55 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
If you find this model useful, please cite the technical report:
|
| 58 |
|
| 59 |
+
```bibtex
|
| 60 |
+
@article{funaudiochat2025,
|
| 61 |
+
title={Fun-Audio-Chat Technical Report},
|
| 62 |
+
author={Qian Chen and Luyao Cheng and Chong Deng and Xiangang Li and Jiaqing Liu and Chao-Hong Tan and Wen Wang and Junhao Xu and Jieping Ye and Qinglin Zhang and Qiquan Zhang and Jingren Zhou},
|
| 63 |
+
year={2025},
|
| 64 |
+
eprint={2512.20156},
|
| 65 |
+
archivePrefix={arXiv},
|
| 66 |
+
primaryClass={cs.CL},
|
| 67 |
+
url={https://arxiv.org/abs/2512.20156},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
}
|
| 69 |
|
| 70 |
+
@misc{tan2025drvoiceparallelspeechtextvoice,
|
| 71 |
+
title={DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations},
|
| 72 |
+
author={Chao-Hong Tan and Qian Chen and Wen Wang and Chong Deng and Qinglin Zhang and Luyao Cheng and Hai Yu and Xin Zhang and Xiang Lv and Tianyu Zhao and Chong Zhang and Yukun Ma and Yafeng Chen and Hui Wang and Jiaqing Liu and Xiangang Li and Jieping Ye},
|
|
|
|
|
|
|
| 73 |
year={2025},
|
| 74 |
+
eprint={2506.09349},
|
| 75 |
+
archivePrefix={arXiv},
|
| 76 |
+
primaryClass={cs.CL},
|
| 77 |
+
url={https://arxiv.org/abs/2506.09349},
|
| 78 |
}
|
| 79 |
+
```
|
|
|
|
|
|
|
|
|