| datasets: | |
| - LEMAS-Project/LEMAS-Dataset-train | |
| - LEMAS-Project/LEMAS-Dataset-eval | |
| language: | |
| - it | |
| - pt | |
| - es | |
| - fr | |
| - de | |
| - vi | |
| - id | |
| - ru | |
| - en | |
| - zh | |
| license: cc-by-nc-4.0 | |
| pipeline_tag: text-to-speech | |
| tags: | |
| - zero-shot | |
| - multilingual | |
| # LEMAS-TTS | |
| LEMAS-TTS is a multilingual zero-shot text-to-speech system, presented in the paper [LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models](https://huggingface.co/papers/2601.04233). | |
| - **Project Page:** [https://lemas-project.github.io/LEMAS-Project](https://lemas-project.github.io/LEMAS-Project) | |
| - **Paper:** [https://arxiv.org/abs/2601.04233](https://arxiv.org/abs/2601.04233) | |
| - **GitHub Repository:** [https://github.com/LEMAS-Project/LEMAS-TTS](https://github.com/LEMAS-Project/LEMAS-TTS) | |
| - **Hugging Face Demo:** [https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS](https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS) | |
| ## Model Description | |
| LEMAS-TTS is built upon a non-autoregressive flow-matching framework. It leverages the massive scale and linguistic diversity of the LEMAS-Dataset to achieve robust zero-shot multilingual synthesis. The model incorporates accent-adversarial training and CTC loss to mitigate cross-lingual accent issues, enhancing synthesis stability and quality across diverse languages. | |
| ## Supported Languages | |
| The model supports 10 major languages for zero-shot synthesis: | |
| - Chinese (zh) | |
| - English (en) | |
| - Spanish (es) | |
| - Russian (ru) | |
| - French (fr) | |
| - German (de) | |
| - Italian (it) | |
| - Portuguese (pt) | |
| - Indonesian (id) | |
| - Vietnamese (vi) | |
| ## Training Data | |
| LEMAS-TTS was trained on the [LEMAS-Dataset](https://huggingface.co/datasets/LEMAS-Project/LEMAS-Dataset-train), which is, to our knowledge, currently the largest open-source multilingual speech corpus with word-level timestamps. It covers over 150,000 hours across 10 major languages. | |
| ## Citation | |
| ```bibtex | |
| @article{zhao2026lemas, | |
| title={LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models}, | |
| author={Zhao, Zhiyuan and Lin, Lijian and Zhu, Ye and Xie, Kai and Liu, Yunfei and Li, Yu}, | |
| journal={arXiv preprint arXiv:2601.04233}, | |
| year={2026} | |
| } | |
| ``` |