edwko commited on
Commit
1de7729
·
verified ·
1 Parent(s): 4ffc833

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +300 -3
README.md CHANGED
@@ -1,3 +1,300 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ language:
4
+ - en
5
+ - ar
6
+ - zh
7
+ - nl
8
+ - fr
9
+ - de
10
+ - it
11
+ - ja
12
+ - ko
13
+ - lt
14
+ - ru
15
+ - es
16
+ - pt
17
+ - be
18
+ - bn
19
+ - ka
20
+ - hu
21
+ - lv
22
+ - fa
23
+ - pl
24
+ - sw
25
+ - ta
26
+ - uk
27
+ pipeline_tag: text-to-speech
28
+ library_name: outetts
29
+ ---
30
+ <div class="p-4 bg-gray-50 dark:bg-gray-800 rounded-lg shadow-sm mb-12">
31
+ <div class="text-center mb-4">
32
+ <h2 class="text-xl font-light text-gray-900 dark:text-white tracking-tight mt-0 mb-0">Oute A I</h2>
33
+ <div class="flex justify-center gap-6 mt-4">
34
+ <a href="https://www.outeai.com/" target="_blank" class="flex items-center gap-1 text-gray-700 dark:text-gray-300 text-m font-medium hover:text-gray-900 dark:hover:text-white transition-colors underline">
35
+ <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
36
+ <circle cx="12" cy="12" r="10"></circle>
37
+ <path d="M2 12h20M12 2a15.3 15.3 0 0 1 4 10 15.3 15.3 0 0 1-4 10 15.3 15.3 0 0 1-4-10 15.3 15.3 0 0 1 4-10z"></path>
38
+ </svg>
39
+ outeai.com
40
+ </a>
41
+ <a href="https://discord.gg/vyBM87kAmf" target="_blank" class="flex items-center gap-1 text-gray-700 dark:text-gray-300 text-m font-medium hover:text-gray-900 dark:hover:text-white transition-colors underline">
42
+ <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
43
+ <path d="M21 11.5a8.38 8.38 0 0 1-.9 3.8 8.5 8.5 0 0 1-7.6 4.7 8.38 8.38 0 0 1-3.8-.9L3 21l1.9-5.7a8.38 8.38 0 0 1-.9-3.8 8.5 8.5 0 0 1 4.7-7.6 8.38 8.38 0 0 1 3.8-.9h.5a8.48 8.48 0 0 1 8 8v.5z"></path>
44
+ </svg>
45
+ Discord
46
+ </a>
47
+ <a href="https://x.com/OuteAI" target="_blank" class="flex items-center gap-1 text-gray-700 dark:text-gray-300 text-m font-medium hover:text-gray-900 dark:hover:text-white transition-colors underline">
48
+ <svg width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
49
+ <path d="M23 3a10.9 10.9 0 0 1-3.14 1.53 4.48 4.48 0 0 0-7.86 3v1A10.66 10.66 0 0 1 3 4s-4 9 5 13a11.64 11.64 0 0 1-7 2c9 5 20 0 20-11.5a4.5 4.5 0 0 0-.08-.83A7.72 7.72 0 0 0 23 3z"></path>
50
+ </svg>
51
+ @OuteAI
52
+ </a>
53
+ </div>
54
+ </div>
55
+
56
+ <div class="grid grid-cols-3 sm:grid-cols-3 gap-2">
57
+ <a href="https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
58
+ Llama OuteTTS 1.0 1B
59
+ </a>
60
+ <a href="https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-GGUF" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
61
+ Llama OuteTTS 1.0 1B GGUF
62
+ </a>
63
+ <a href="https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-FP8" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
64
+ Llama OuteTTS 1.0 1B FP8
65
+ </a>
66
+ <a href="https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-EXL2-8bpw" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
67
+ Llama OuteTTS 1.0 1B 8bpw
68
+ </a>
69
+ <a href="https://github.com/edwko/OuteTTS" target="_blank" class="bg-white dark:bg-gray-700 text-gray-800 dark:text-gray-100 text-sm font-medium py-2 px-3 rounded-md text-center hover:bg-gray-100 dark:hover:bg-gray-600 hover:border-gray-300 dark:hover:border-gray-500 border border-transparent transition-all">
70
+ GitHub Library
71
+ </a>
72
+ </div>
73
+ </div>
74
+
75
+ > [!IMPORTANT]
76
+ > **Important Sampling Considerations**
77
+ >
78
+ > When using OuteTTS version 1.0, it is crucial to use the settings specified in the [Sampling Configuration](#sampling-configuration) section.
79
+ > The **repetition penalty implementation** is particularly important - this model requires penalization applied to a **64-token recent window**,
80
+ > rather than across the entire context window. Penalizing the entire context will cause the model to produce **broken or low-quality output**.
81
+ >
82
+ > To address this limitation, all necessary samplers and patches for all backends are set up automatically in the **outetts** library.
83
+ > If using a custom implementation, ensure you correctly implement these requirements.
84
+
85
+ # OuteTTS Version 1.0
86
+
87
+ This update brings significant improvements in speech synthesis and voice cloning—delivering a more powerful, accurate, and user-friendly experience in a compact size.
88
+
89
+ ## What's New
90
+
91
+ ### 1. Prompt Revamp & Dependency Removal
92
+ - **Automatic Word Alignment:** The model now performs word alignment internally. Simply input raw text—no pre-processing required—and the model handles the rest, streamlining your workflow. For optimal results, use normalized, readable text without newlines (light normalization is applied automatically in outetts library).
93
+ - **Native Multilingual Text Support:** Direct support for native text across multiple languages eliminates the need for romanization.
94
+ - **Enhanced Metadata Integration:** The updated prompt system incorporates additional metadata (time, energy, spectral centroid, pitch) at both global and word levels, improving speaker flow and synthesis quality.
95
+ - **Special Tokens for Audio Codebooks:** New tokens for c1 (codebook 1) and c2 (codebook 2).
96
+
97
+ ### 2. New Audio Encoder Model
98
+ - **DAC Encoder:** Integrates a DAC audio encoder from [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0), utilizing two codebooks for high quality audio reconstruction.
99
+ - **Performance Trade-off:** Improved audio fidelity increases the token generation rate from 75 to 150 tokens per second. This trade-off prioritizes quality, especially for multilingual applications.
100
+
101
+ ### 3. Voice Cloning
102
+ - **One-Shot Voice Cloning:** To achieve one-shot cloning, the model typically requires only around **10 seconds** of reference audio to produce an accurate voice representation.
103
+ - **Improved Accuracy:** Enhanced by the new encoder and additional training metadata, voice cloning is now more natural and precise.
104
+
105
+ ### 4. Auto Text Alignment & Numerical Support
106
+ - **Automatic Text Alignment:** Aligns raw text at the word level, even for languages without clear boundaries (e.g., Japanese, Chinese), using insights from pre-processed training data.
107
+ - **Direct Numerical Input:** Built-in multilingual numerical support allows direct use of numbers in prompts—no textual conversion needed. (The model typically chooses the dominant language present. Mixing languages in a single prompt may lead to mistakes.)
108
+
109
+ ### 5. Multilingual Capabilities
110
+
111
+ - **Supported Languages:** OuteTTS offers varying proficiency levels across languages, based on training data exposure.
112
+
113
+ - **High Training Data Languages:** These languages feature extensive training: **English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Lithuanian, Russian, Spanish**
114
+
115
+ - **Moderate Training Data Languages:** These languages received moderate training, offering good performance with occasional limitations: **Portuguese, Belarusian, Bengali, Georgian, Hungarian, Latvian, Persian/Farsi, Polish, Swahili, Tamil, Ukrainian**
116
+
117
+ - **Beyond Supported Languages:** The model can generate speech in untrained languages with varying success. Experiment with unlisted languages, though results may not be optimal.
118
+
119
+ ## Video Showcase
120
+
121
+ <video width="1280" height="720" controls style="box-shadow: 0px 0px 20px 10px rgba(0, 0, 0, 0.05), 0px 1px 3px 10px rgba(255, 255, 255, 0.05);">
122
+ <source src="https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-GGUF/resolve/main/media/showcase.mp4" type="video/mp4">
123
+ Your browser does not support the video tag.
124
+ </video>
125
+
126
+ ## OuteTTS Python Package v0.4.2
127
+
128
+ New version adds **batched inference** generation with the latest OuteTTS release.
129
+
130
+ ### ⚡ **Batched RTF Benchmarks**
131
+ Tested with **NVIDIA L40S GPU**
132
+
133
+ ![rtf](https://huggingface.co/OuteAI/OuteTTS-1.0-0.6B-GGUF/resolve/main/assets/rtf.png)
134
+
135
+ ## Quick Start Guide
136
+
137
+ Getting started with **OuteTTS** is simple:
138
+
139
+ ### Installation
140
+
141
+ 🔗 [Installation instructions](https://github.com/edwko/OuteTTS?tab=readme-ov-file#installation)
142
+
143
+ ### Basic Usage
144
+ ```python
145
+ import outetts
146
+
147
+ # Initialize the interface
148
+ interface = outetts.Interface(
149
+ config=outetts.ModelConfig.auto_config(
150
+ model=outetts.Models.VERSION_1_0_SIZE_1B,
151
+ # For llama.cpp backend
152
+ backend=outetts.Backend.LLAMACPP,
153
+ quantization=outetts.LlamaCppQuantization.FP16
154
+ # For transformers backend
155
+ # backend=outetts.Backend.HF,
156
+ )
157
+ )
158
+
159
+ # Load the default speaker profile
160
+ speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")
161
+
162
+ # Or create your own speaker profiles in seconds and reuse them instantly
163
+ # speaker = interface.create_speaker("path/to/audio.wav")
164
+ # interface.save_speaker(speaker, "speaker.json")
165
+ # speaker = interface.load_speaker("speaker.json")
166
+
167
+ # Generate speech
168
+ output = interface.generate(
169
+ config=outetts.GenerationConfig(
170
+ text="Hello, how are you doing?",
171
+ generation_type=outetts.GenerationType.CHUNKED,
172
+ speaker=speaker,
173
+ sampler_config=outetts.SamplerConfig(
174
+ temperature=0.4
175
+ ),
176
+ )
177
+ )
178
+
179
+ # Save to file
180
+ output.save("output.wav")
181
+ ```
182
+
183
+ ### ⚡ Batch Setup
184
+ ```python
185
+ from outetts import Interface, ModelConfig, GenerationConfig, Backend, GenerationType
186
+
187
+ if __name__ == "__main__":
188
+ # Initialize the interface with a batch-capable backend
189
+ interface = Interface(
190
+ ModelConfig(
191
+ model_path="OuteAI/Llama-OuteTTS-1.0-1B-FP8",
192
+ tokenizer_path="OuteAI/Llama-OuteTTS-1.0-1B",
193
+ backend=Backend.VLLM
194
+ # For EXL2, use backend=Backend.EXL2ASYNC + exl2_cache_seq_multiply={should be same as max_batch_size in GenerationConfig}
195
+ # For LLAMACPP_ASYNC_SERVER, use backend=Backend.LLAMACPP_ASYNC_SERVER and provide server_host in GenerationConfig
196
+ )
197
+ )
198
+
199
+ # Load your speaker profile
200
+ speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL") # Or load/create custom speaker
201
+
202
+ # Generate speech using BATCH type
203
+ # Note: For EXL2ASYNC, VLLM, LLAMACPP_ASYNC_SERVER, BATCH is automatically selected.
204
+ output = interface.generate(
205
+ GenerationConfig(
206
+ text="This is a longer text that will be automatically split into chunks and processed in batches.",
207
+ speaker=speaker,
208
+ generation_type=GenerationType.BATCH,
209
+ max_batch_size=32, # Adjust based on your GPU memory and server capacity
210
+ dac_decoding_chunk=2048, # Adjust chunk size for DAC decoding
211
+ # If using LLAMACPP_ASYNC_SERVER, add:
212
+ # server_host="http://localhost:8000" # Replace with your server address
213
+ )
214
+ )
215
+
216
+ # Save to file
217
+ output.save("output_batch.wav")
218
+ ```
219
+
220
+ ### More Configuration Options
221
+ For advanced settings and customization, visit the official repository:
222
+
223
+ [![Documentation](https://img.shields.io/badge/📖_Read_The_Docs-Interface_Guide-blue?style=for-the-badge)](https://github.com/edwko/OuteTTS/blob/main/docs/interface_usage.md)
224
+
225
+ ## Usage Recommendations
226
+
227
+ ### Speaker Reference
228
+ The model is designed to be used with a speaker reference. Without one, it generates random vocal characteristics, often leading to lower-quality outputs.
229
+ The model inherits the referenced speaker's emotion, style, and accent.
230
+ When transcribing to other languages with the same speaker, you may observe the model retaining the original accent.
231
+
232
+ ### Multilingual Application
233
+ It is recommended to create a speaker profile in the language you intend to use. This helps achieve the best results in that specific language, including tone, accent, and linguistic features.
234
+
235
+ While the model supports cross-lingual speech, it still relies on the reference speaker. If the speaker has a distinct accent—such as British English—other languages may carry that accent as well.
236
+
237
+ ### Optimal Audio Length
238
+ - **Best Performance:** Generate audio around **42 seconds** in a single run (approximately 8,192 tokens). It is recomended not to near the limits of this windows when generating. Usually, the best results are up to 7,000 tokens.
239
+ - **Context Reduction with Speaker Reference:** If the speaker reference is 10 seconds long, the effective context is reduced to approximately 32 seconds.
240
+
241
+ ### Temperature Setting Recommendations
242
+ Testing shows that a temperature of **0.4** is an ideal starting point for accuracy (with the sampling settings below). However, some voice references may benefit from higher temperatures for enhanced expressiveness or slightly lower temperatures for more precise voice replication.
243
+
244
+ ### Verifying Speaker Encoding
245
+ If the cloned voice quality is subpar, check the encoded speaker sample.
246
+
247
+ ```python
248
+ interface.decode_and_save_speaker(speaker=your_speaker, path="speaker.wav")
249
+ ```
250
+
251
+ The DAC audio reconstruction model is lossy, and samples with clipping, excessive loudness, or unusual vocal features may introduce encoding issues that impact output quality.
252
+
253
+ ### Sampling Configuration
254
+ For optimal results with this TTS model, use the following sampling settings.
255
+
256
+ | Parameter | Value |
257
+ |-------------------|----------|
258
+ | Temperature | 0.4 |
259
+ | Repetition Penalty| 1.1 |
260
+ | **Repetition Range** | **64** |
261
+ | Top-k | 40 |
262
+ | Top-p | 0.9 |
263
+ | Min-p | 0.05 |
264
+
265
+ ## Model Specifications
266
+
267
+ - **Training Data:** Trained on **~60k hours of audio**
268
+ - **Context Length:** Supports a maximum context window of **8,192 tokens**
269
+
270
+ ### Training Parameters
271
+
272
+ #### **Pre-Training**
273
+ - **Optimizer:** AdamW
274
+ - **Batch Size:** 1 million tokens
275
+ - **Max Learning Rate:** 3e-4
276
+ - **Min Learning Rate:** 3e-5
277
+ - **Context Length:** 8192
278
+
279
+ #### **Fine-Tuning**
280
+ - **Optimizer:** AdamW
281
+ - **Max Learning Rate:** 1e-5
282
+ - **Min Learning Rate:** 5e-6
283
+ - **Data:** 10,000 diverse, high-quality examples
284
+
285
+ ## License Information
286
+
287
+ - **Initial Llama3.2 Components:** [Llama 3.2 Community License Agreement ](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt)
288
+ - **Our Continued Pre-Training, Fine-Tuning, and Additional Components:** [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
289
+
290
+ ## Acknowledgments
291
+
292
+ - Big thanks to **Hugging Face** for their continued resource support through their grant program!
293
+ - Audio encoding and decoding utilize [ibm-research/DAC.speech.v1.0](https://huggingface.co/ibm-research/DAC.speech.v1.0)
294
+ - OuteTTS is built with [Llama3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) as the base model, with continued pre-training and fine-tuning.
295
+
296
+ ### Ethical Use Guidelines
297
+ This text-to-speech model is intended for legitimate applications that enhance accessibility, creativity, and communication;
298
+ prohibited uses include impersonation without consent, creation of deliberately misleading content,
299
+ generation of harmful or harassing material, distribution of synthetic audio without proper disclosure,
300
+ voice cloning without permission, and any uses that violate applicable laws, regulations, or copyrights.