---
license: apache-2.0
language:
- multilingual
tags:
- text-to-speech
- tts
- audio
- voice-cloning
- emotion-control
- marco-voice
---

### License
This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0). 

We used compliance checking algorithms during the training process, to ensure the compliance of the trained model(s) to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

### 🗣️ Marco-Voice: Multilingual Emotion-Controllable Speech Synthesis
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

Marco-Voice is an open-source, text-to-speech (TTS) framework that enables high-fidelity voice cloning and fine-grained emotional control in synthesized speech. By integrating advanced disentanglement techniques, Marco-Voice supports expressive speech generation with studio-grade quality.

Try it out on Hugging Face Spaces or integrate it into your applications for controllable, natural-sounding synthetic voices.

### 📌 Model Details
*   **Developed by:** Marco-Voice Team
*   **Model type:** Neural Text-to-Speech (TTS) with speaker and emotion conditioning
*   **Emotion supported:** 7 (including sad, surprise, happy, etc.)
*   **Voice cloning:** Zero-shot / few-shot speaker adaptation
*   **Architecture compatibility:** Works with CosyVoice backbones

### 🎯 Intended Uses & Capabilities
Marco-Voice is designed for applications requiring emotionally expressive and speaker-consistent synthetic speech, such as:

*   Virtual assistants with personalized voices and emotional tone
*   Audiobook narration with dynamic prosody
*   Gaming and animation voice synthesis
*   Accessibility tools (e.g., expressive screen readers)
*   Cross-lingual voice dubbing with preserved speaker identity

### ⚠️ Limitations & Ethical Considerations
There is a trade-off between timbre similarity and emotion control.

### 🚀 Getting Started
Set up your environment:

```bash
conda create -n marco python=3.8
conda activate marco
pip install -r requirements.txt


🎯 Inference Example
```python
from Models.marco_voice.cosyvoice_rodis.cli.cosyvoice import CosyVoice
from Models.marco_voice.cosyvoice_emosphere.cli.cosyvoice import CosyVoice as cosy_emosphere
from Models.marco_voice.cosyvoice_rodis.utils.file_utils import load_wav
import torch
import torchaudio

# Load pre-trained models
model = CosyVoice('pretrained_models/v4', load_jit=False, load_onnx=False, fp16=False)
model_emosphere = cosy_emosphere('pretrained_models/v5', load_jit=False, load_onnx=False, fp16=False)

# Define emotion mapping
emo = {
    "伤心": "Sad",
    "恐惧": "Fearful",
    "快乐": "Happy",
    "惊喜": "Surprise",
    "生气": "Angry",
    "戏谑": "Jolliest"
}

# Load reference speech for voice cloning
prompt_speech_16k = load_wav("your_audio_path/exam.wav", 16000)
emo_type = "快乐"

# Load emotion embedding based on speaker and emotion
if emo_type in ["生气", "惊喜", "快乐"]:
    emotion_info = torch.load("assets/emotion_info.pt")["male005"][emo.get(emo_type)]
elif emo_type in ["伤心"]:
    emotion_info = torch.load("assets/emotion_info.pt")["female005"][emo.get(emo_type)]
elif emo_type in ["恐惧"]:
    emotion_info = torch.load("assets/emotion_info.pt")["female003"][emo.get(emo_type)]
else:
    emotion_info = torch.load("assets/emotion_info.pt")["male005"][emo.get(emo_type)]

# 1. Discrete emotion control
for i, j in enumerate(model.synthesize(
    text="今天的天气真不错，我们出去散步吧！",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emo_type=emo_type,
    emotion_embedding=emotion_info
)):
    torchaudio.save(f'emotional_{emo_type}.wav', j['tts_speech'], 22050)

# 2. Continuous emotion control (Emosphere)
for i, j in enumerate(model_emosphere.synthesize(
    text="今天的天气真不错，我们出去散步吧！",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emotion_embedding=emotion_info,
    low_level_emo_embedding=[0.1, 0.4, 0.5]
)):
    torchaudio.save(f'emosphere_{emo_type}.wav', j['tts_speech'], 22050)

# 3. Cross-lingual emotion transfer
for i, j in enumerate(model.synthesize(
    text="hello, i'm a speech synthesis model, how are you today?",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emo_type=emo_type,
    emotion_embedding=emotion_info
)):
    torchaudio.save(f'cross_lingual_{emo_type}.wav', j['tts_speech'], 22050)
You can also run inference using the provided script:

```

For inference scripts, training recipes, documentation, and more, visit our GitHub repository:
👉 https://github.com/AIDC-AI/Marco-Voice