|
|
--- |
|
|
datasets: |
|
|
- LEMAS-Project/LEMAS-Dataset-train |
|
|
- LEMAS-Project/LEMAS-Dataset-eval |
|
|
language: |
|
|
- it |
|
|
- pt |
|
|
- es |
|
|
- fr |
|
|
- de |
|
|
- vi |
|
|
- id |
|
|
- ru |
|
|
- en |
|
|
- zh |
|
|
license: cc-by-nc-4.0 |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- zero-shot |
|
|
- multilingual |
|
|
--- |
|
|
|
|
|
# LEMAS-TTS |
|
|
|
|
|
LEMAS-TTS is a multilingual zero-shot text-to-speech system, presented in the paper [LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models](https://huggingface.co/papers/2601.04233). |
|
|
|
|
|
- **Project Page:** [https://lemas-project.github.io/LEMAS-Project](https://lemas-project.github.io/LEMAS-Project) |
|
|
- **Paper:** [https://arxiv.org/abs/2601.04233](https://arxiv.org/abs/2601.04233) |
|
|
- **GitHub Repository:** [https://github.com/LEMAS-Project/LEMAS-TTS](https://github.com/LEMAS-Project/LEMAS-TTS) |
|
|
- **Hugging Face Demo:** [https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS](https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
LEMAS-TTS is built upon a non-autoregressive flow-matching framework. It leverages the massive scale and linguistic diversity of the LEMAS-Dataset to achieve robust zero-shot multilingual synthesis. The model incorporates accent-adversarial training and CTC loss to mitigate cross-lingual accent issues, enhancing synthesis stability and quality across diverse languages. |
|
|
|
|
|
## Supported Languages |
|
|
|
|
|
The model supports 10 major languages for zero-shot synthesis: |
|
|
- Chinese (zh) |
|
|
- English (en) |
|
|
- Spanish (es) |
|
|
- Russian (ru) |
|
|
- French (fr) |
|
|
- German (de) |
|
|
- Italian (it) |
|
|
- Portuguese (pt) |
|
|
- Indonesian (id) |
|
|
- Vietnamese (vi) |
|
|
|
|
|
## Training Data |
|
|
|
|
|
LEMAS-TTS was trained on the [LEMAS-Dataset](https://huggingface.co/datasets/LEMAS-Project/LEMAS-Dataset-train), which is, to our knowledge, currently the largest open-source multilingual speech corpus with word-level timestamps. It covers over 150,000 hours across 10 major languages. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{zhao2026lemas, |
|
|
title={LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models}, |
|
|
author={Zhao, Zhiyuan and Lin, Lijian and Zhu, Ye and Xie, Kai and Liu, Yunfei and Li, Yu}, |
|
|
journal={arXiv preprint arXiv:2601.04233}, |
|
|
year={2026} |
|
|
} |
|
|
``` |