--- license: apache-2.0 language: - multilingual tags: - text-to-speech - tts - audio - voice-cloning - emotion-control - marco-voice --- ### License This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0). We used compliance checking algorithms during the training process, to ensure the compliance of the trained model(s) to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter. ### 🗣️ Marco-Voice: Multilingual Emotion-Controllable Speech Synthesis [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) Marco-Voice is an open-source, text-to-speech (TTS) framework that enables high-fidelity voice cloning and fine-grained emotional control in synthesized speech. By integrating advanced disentanglement techniques, Marco-Voice supports expressive speech generation with studio-grade quality. Try it out on Hugging Face Spaces or integrate it into your applications for controllable, natural-sounding synthetic voices. ### 📌 Model Details * **Developed by:** Marco-Voice Team * **Model type:** Neural Text-to-Speech (TTS) with speaker and emotion conditioning * **Emotion supported:** 7 (including sad, surprise, happy, etc.) * **Voice cloning:** Zero-shot / few-shot speaker adaptation * **Architecture compatibility:** Works with CosyVoice backbones ### 🎯 Intended Uses & Capabilities Marco-Voice is designed for applications requiring emotionally expressive and speaker-consistent synthetic speech, such as: * Virtual assistants with personalized voices and emotional tone * Audiobook narration with dynamic prosody * Gaming and animation voice synthesis * Accessibility tools (e.g., expressive screen readers) * Cross-lingual voice dubbing with preserved speaker identity ### ⚠️ Limitations & Ethical Considerations There is a trade-off between timbre similarity and emotion control. ### 🚀 Getting Started Set up your environment: ```bash conda create -n marco python=3.8 conda activate marco pip install -r requirements.txt 🎯 Inference Example ```python from Models.marco_voice.cosyvoice_rodis.cli.cosyvoice import CosyVoice from Models.marco_voice.cosyvoice_emosphere.cli.cosyvoice import CosyVoice as cosy_emosphere from Models.marco_voice.cosyvoice_rodis.utils.file_utils import load_wav import torch import torchaudio # Load pre-trained models model = CosyVoice('pretrained_models/v4', load_jit=False, load_onnx=False, fp16=False) model_emosphere = cosy_emosphere('pretrained_models/v5', load_jit=False, load_onnx=False, fp16=False) # Define emotion mapping emo = { "伤心": "Sad", "恐惧": "Fearful", "快乐": "Happy", "惊喜": "Surprise", "生气": "Angry", "戏谑": "Jolliest" } # Load reference speech for voice cloning prompt_speech_16k = load_wav("your_audio_path/exam.wav", 16000) emo_type = "快乐" # Load emotion embedding based on speaker and emotion if emo_type in ["生气", "惊喜", "快乐"]: emotion_info = torch.load("assets/emotion_info.pt")["male005"][emo.get(emo_type)] elif emo_type in ["伤心"]: emotion_info = torch.load("assets/emotion_info.pt")["female005"][emo.get(emo_type)] elif emo_type in ["恐惧"]: emotion_info = torch.load("assets/emotion_info.pt")["female003"][emo.get(emo_type)] else: emotion_info = torch.load("assets/emotion_info.pt")["male005"][emo.get(emo_type)] # 1. Discrete emotion control for i, j in enumerate(model.synthesize( text="今天的天气真不错,我们出去散步吧!", prompt_text="", reference_speech=prompt_speech_16k, emo_type=emo_type, emotion_embedding=emotion_info )): torchaudio.save(f'emotional_{emo_type}.wav', j['tts_speech'], 22050) # 2. Continuous emotion control (Emosphere) for i, j in enumerate(model_emosphere.synthesize( text="今天的天气真不错,我们出去散步吧!", prompt_text="", reference_speech=prompt_speech_16k, emotion_embedding=emotion_info, low_level_emo_embedding=[0.1, 0.4, 0.5] )): torchaudio.save(f'emosphere_{emo_type}.wav', j['tts_speech'], 22050) # 3. Cross-lingual emotion transfer for i, j in enumerate(model.synthesize( text="hello, i'm a speech synthesis model, how are you today?", prompt_text="", reference_speech=prompt_speech_16k, emo_type=emo_type, emotion_embedding=emotion_info )): torchaudio.save(f'cross_lingual_{emo_type}.wav', j['tts_speech'], 22050) You can also run inference using the provided script: ``` For inference scripts, training recipes, documentation, and more, visit our GitHub repository: 👉 https://github.com/AIDC-AI/Marco-Voice