License

This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0).

We used compliance checking algorithms during the training process, to ensure the compliance of the trained model(s) to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

🗣️ Marco-Voice: Multilingual Emotion-Controllable Speech Synthesis

Marco-Voice is an open-source, text-to-speech (TTS) framework that enables high-fidelity voice cloning and fine-grained emotional control in synthesized speech. By integrating advanced disentanglement techniques, Marco-Voice supports expressive speech generation with studio-grade quality.

Try it out on Hugging Face Spaces or integrate it into your applications for controllable, natural-sounding synthetic voices.

📌 Model Details

Developed by: Marco-Voice Team
Model type: Neural Text-to-Speech (TTS) with speaker and emotion conditioning
Emotion supported: 7 (including sad, surprise, happy, etc.)
Voice cloning: Zero-shot / few-shot speaker adaptation
Architecture compatibility: Works with CosyVoice backbones

🎯 Intended Uses & Capabilities

Marco-Voice is designed for applications requiring emotionally expressive and speaker-consistent synthetic speech, such as:

Virtual assistants with personalized voices and emotional tone
Audiobook narration with dynamic prosody
Gaming and animation voice synthesis
Accessibility tools (e.g., expressive screen readers)
Cross-lingual voice dubbing with preserved speaker identity

⚠️ Limitations & Ethical Considerations

There is a trade-off between timbre similarity and emotion control.

🚀 Getting Started

Set up your environment:

conda create -n marco python=3.8
conda activate marco
pip install -r requirements.txt


🎯 Inference Example
```python
from Models.marco_voice.cosyvoice_rodis.cli.cosyvoice import CosyVoice
from Models.marco_voice.cosyvoice_emosphere.cli.cosyvoice import CosyVoice as cosy_emosphere
from Models.marco_voice.cosyvoice_rodis.utils.file_utils import load_wav
import torch
import torchaudio

# Load pre-trained models
model = CosyVoice('pretrained_models/v4', load_jit=False, load_onnx=False, fp16=False)
model_emosphere = cosy_emosphere('pretrained_models/v5', load_jit=False, load_onnx=False, fp16=False)

# Define emotion mapping
emo = {
    "伤心": "Sad",
    "恐惧": "Fearful",
    "快乐": "Happy",
    "惊喜": "Surprise",
    "生气": "Angry",
    "戏谑": "Jolliest"
}

# Load reference speech for voice cloning
prompt_speech_16k = load_wav("your_audio_path/exam.wav", 16000)
emo_type = "快乐"

# Load emotion embedding based on speaker and emotion
if emo_type in ["生气", "惊喜", "快乐"]:
    emotion_info = torch.load("assets/emotion_info.pt")["male005"][emo.get(emo_type)]
elif emo_type in ["伤心"]:
    emotion_info = torch.load("assets/emotion_info.pt")["female005"][emo.get(emo_type)]
elif emo_type in ["恐惧"]:
    emotion_info = torch.load("assets/emotion_info.pt")["female003"][emo.get(emo_type)]
else:
    emotion_info = torch.load("assets/emotion_info.pt")["male005"][emo.get(emo_type)]

# 1. Discrete emotion control
for i, j in enumerate(model.synthesize(
    text="今天的天气真不错，我们出去散步吧！",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emo_type=emo_type,
    emotion_embedding=emotion_info
)):
    torchaudio.save(f'emotional_{emo_type}.wav', j['tts_speech'], 22050)

# 2. Continuous emotion control (Emosphere)
for i, j in enumerate(model_emosphere.synthesize(
    text="今天的天气真不错，我们出去散步吧！",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emotion_embedding=emotion_info,
    low_level_emo_embedding=[0.1, 0.4, 0.5]
)):
    torchaudio.save(f'emosphere_{emo_type}.wav', j['tts_speech'], 22050)

# 3. Cross-lingual emotion transfer
for i, j in enumerate(model.synthesize(
    text="hello, i'm a speech synthesis model, how are you today?",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emo_type=emo_type,
    emotion_embedding=emotion_info
)):
    torchaudio.save(f'cross_lingual_{emo_type}.wav', j['tts_speech'], 22050)
You can also run inference using the provided script:

For inference scripts, training recipes, documentation, and more, visit our GitHub repository: 👉 https://github.com/AIDC-AI/Marco-Voice

Downloads last month: -; Downloads are not tracked for this model. How to track