License

This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0).

We used compliance checking algorithms during the training process, to ensure the compliance of the trained model(s) to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

πŸ—£οΈ Marco-Voice: Multilingual Emotion-Controllable Speech Synthesis

License

Marco-Voice is an open-source, text-to-speech (TTS) framework that enables high-fidelity voice cloning and fine-grained emotional control in synthesized speech. By integrating advanced disentanglement techniques, Marco-Voice supports expressive speech generation with studio-grade quality.

Try it out on Hugging Face Spaces or integrate it into your applications for controllable, natural-sounding synthetic voices.

πŸ“Œ Model Details

  • Developed by: Marco-Voice Team
  • Model type: Neural Text-to-Speech (TTS) with speaker and emotion conditioning
  • Emotion supported: 7 (including sad, surprise, happy, etc.)
  • Voice cloning: Zero-shot / few-shot speaker adaptation
  • Architecture compatibility: Works with CosyVoice backbones

🎯 Intended Uses & Capabilities

Marco-Voice is designed for applications requiring emotionally expressive and speaker-consistent synthetic speech, such as:

  • Virtual assistants with personalized voices and emotional tone
  • Audiobook narration with dynamic prosody
  • Gaming and animation voice synthesis
  • Accessibility tools (e.g., expressive screen readers)
  • Cross-lingual voice dubbing with preserved speaker identity

⚠️ Limitations & Ethical Considerations

There is a trade-off between timbre similarity and emotion control.

πŸš€ Getting Started

Set up your environment:

conda create -n marco python=3.8
conda activate marco
pip install -r requirements.txt


🎯 Inference Example
```python
from Models.marco_voice.cosyvoice_rodis.cli.cosyvoice import CosyVoice
from Models.marco_voice.cosyvoice_emosphere.cli.cosyvoice import CosyVoice as cosy_emosphere
from Models.marco_voice.cosyvoice_rodis.utils.file_utils import load_wav
import torch
import torchaudio

# Load pre-trained models
model = CosyVoice('pretrained_models/v4', load_jit=False, load_onnx=False, fp16=False)
model_emosphere = cosy_emosphere('pretrained_models/v5', load_jit=False, load_onnx=False, fp16=False)

# Define emotion mapping
emo = {
    "δΌ€εΏƒ": "Sad",
    "恐惧": "Fearful",
    "快乐": "Happy",
    "ζƒŠε–œ": "Surprise",
    "η”Ÿζ°”": "Angry",
    "ζˆθ°‘": "Jolliest"
}

# Load reference speech for voice cloning
prompt_speech_16k = load_wav("your_audio_path/exam.wav", 16000)
emo_type = "快乐"

# Load emotion embedding based on speaker and emotion
if emo_type in ["η”Ÿζ°”", "ζƒŠε–œ", "快乐"]:
    emotion_info = torch.load("assets/emotion_info.pt")["male005"][emo.get(emo_type)]
elif emo_type in ["δΌ€εΏƒ"]:
    emotion_info = torch.load("assets/emotion_info.pt")["female005"][emo.get(emo_type)]
elif emo_type in ["恐惧"]:
    emotion_info = torch.load("assets/emotion_info.pt")["female003"][emo.get(emo_type)]
else:
    emotion_info = torch.load("assets/emotion_info.pt")["male005"][emo.get(emo_type)]

# 1. Discrete emotion control
for i, j in enumerate(model.synthesize(
    text="δ»Šε€©ηš„ε€©ζ°”ηœŸδΈι”™οΌŒζˆ‘δ»¬ε‡ΊεŽ»ζ•£ζ­₯吧!",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emo_type=emo_type,
    emotion_embedding=emotion_info
)):
    torchaudio.save(f'emotional_{emo_type}.wav', j['tts_speech'], 22050)

# 2. Continuous emotion control (Emosphere)
for i, j in enumerate(model_emosphere.synthesize(
    text="δ»Šε€©ηš„ε€©ζ°”ηœŸδΈι”™οΌŒζˆ‘δ»¬ε‡ΊεŽ»ζ•£ζ­₯吧!",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emotion_embedding=emotion_info,
    low_level_emo_embedding=[0.1, 0.4, 0.5]
)):
    torchaudio.save(f'emosphere_{emo_type}.wav', j['tts_speech'], 22050)

# 3. Cross-lingual emotion transfer
for i, j in enumerate(model.synthesize(
    text="hello, i'm a speech synthesis model, how are you today?",
    prompt_text="",
    reference_speech=prompt_speech_16k,
    emo_type=emo_type,
    emotion_embedding=emotion_info
)):
    torchaudio.save(f'cross_lingual_{emo_type}.wav', j['tts_speech'], 22050)
You can also run inference using the provided script:

For inference scripts, training recipes, documentation, and more, visit our GitHub repository: πŸ‘‰ https://github.com/AIDC-AI/Marco-Voice

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using AIDC-AI/Marco-Voice 1