Text-to-Speech
ONNX
zero-shot
multilingual
File size: 2,180 Bytes
b172477
0125ea2
 
 
b172477
 
 
 
 
 
 
 
 
 
 
0125ea2
9af2719
0125ea2
 
 
b172477
 
0125ea2
 
 
b172477
0125ea2
 
 
 
fd4bd80
0125ea2
fd4bd80
0125ea2
a8d6e72
0125ea2
8770f4f
0125ea2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8770f4f
 
0125ea2
 
8770f4f
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
datasets:
- LEMAS-Project/LEMAS-Dataset-train
- LEMAS-Project/LEMAS-Dataset-eval
language:
- it
- pt
- es
- fr
- de
- vi
- id
- ru
- en
- zh
license: cc-by-nc-4.0
pipeline_tag: text-to-speech
tags:
- zero-shot
- multilingual
---

# LEMAS-TTS

LEMAS-TTS is a multilingual zero-shot text-to-speech system, presented in the paper [LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models](https://huggingface.co/papers/2601.04233).

- **Project Page:** [https://lemas-project.github.io/LEMAS-Project](https://lemas-project.github.io/LEMAS-Project)
- **Paper:** [https://arxiv.org/abs/2601.04233](https://arxiv.org/abs/2601.04233)
- **GitHub Repository:** [https://github.com/LEMAS-Project/LEMAS-TTS](https://github.com/LEMAS-Project/LEMAS-TTS)
- **Hugging Face Demo:** [https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS](https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS)

## Model Description

LEMAS-TTS is built upon a non-autoregressive flow-matching framework. It leverages the massive scale and linguistic diversity of the LEMAS-Dataset to achieve robust zero-shot multilingual synthesis. The model incorporates accent-adversarial training and CTC loss to mitigate cross-lingual accent issues, enhancing synthesis stability and quality across diverse languages.

## Supported Languages

The model supports 10 major languages for zero-shot synthesis:
- Chinese (zh)
- English (en)
- Spanish (es)
- Russian (ru)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Indonesian (id)
- Vietnamese (vi)

## Training Data

LEMAS-TTS was trained on the [LEMAS-Dataset](https://huggingface.co/datasets/LEMAS-Project/LEMAS-Dataset-train), which is, to our knowledge, currently the largest open-source multilingual speech corpus with word-level timestamps. It covers over 150,000 hours across 10 major languages.

## Citation

```bibtex
@article{zhao2026lemas,
  title={LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models},
  author={Zhao, Zhiyuan and Lin, Lijian and Zhu, Ye and Xie, Kai and Liu, Yunfei and Li, Yu},
  journal={arXiv preprint arXiv:2601.04233},
  year={2026}
}
```