Update README.md
Browse files
README.md
CHANGED
|
@@ -2,4 +2,151 @@
|
|
| 2 |
license: bigscience-openrail-m
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: bigscience-openrail-m
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Supertonic — Lightning Fast, On-Device TTS
|
| 8 |
+
|
| 9 |
+
**Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
|
| 10 |
+
|
| 11 |
+
> 🎧 **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo), or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic)
|
| 12 |
+
|
| 13 |
+
### Table of Contents
|
| 14 |
+
|
| 15 |
+
- [Why Supertonic?](#why-supertonic)
|
| 16 |
+
- [Language Support](#language-support)
|
| 17 |
+
- [Getting Started](#getting-started)
|
| 18 |
+
- [Performance](#performance)
|
| 19 |
+
- [Citation](#citation)
|
| 20 |
+
- [License](#license)
|
| 21 |
+
|
| 22 |
+
## Why Supertonic?
|
| 23 |
+
|
| 24 |
+
- **⚡ Blazingly Fast**: Generates speech up to **167× faster than real-time** on consumer hardware (M4 Pro)—unmatched by any other TTS system
|
| 25 |
+
- **🪶 Ultra Lightweight**: Only **66M parameters**, optimized for efficient on-device performance with minimal footprint
|
| 26 |
+
- **📱 On-Device Capable**: **Complete privacy** and **zero latency**—all processing happens locally on your device
|
| 27 |
+
- **🎨 Natural Text Handling**: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
|
| 28 |
+
- **⚙️ Highly Configurable**: Adjust inference steps, batch processing, and other parameters to match your specific needs
|
| 29 |
+
- **🧩 Flexible Deployment**: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
## Language Support
|
| 33 |
+
|
| 34 |
+
We provide ready-to-use TTS inference examples across multiple ecosystems:
|
| 35 |
+
|
| 36 |
+
| Language/Platform | Path | Description |
|
| 37 |
+
|-------------------|------|-------------|
|
| 38 |
+
| [**Python**](py/) | `py/` | ONNX Runtime inference |
|
| 39 |
+
| [**Node.js**](nodejs/) | `nodejs/` | Server-side JavaScript |
|
| 40 |
+
| [**Browser**](web/) | `web/` | WebGPU/WASM inference |
|
| 41 |
+
| [**Java**](java/) | `java/` | Cross-platform JVM |
|
| 42 |
+
| [**C++**](cpp/) | `cpp/` | High-performance C++ |
|
| 43 |
+
| [**C#**](csharp/) | `csharp/` | .NET ecosystem |
|
| 44 |
+
| [**Go**](go/) | `go/` | Go implementation |
|
| 45 |
+
| [**Swift**](swift/) | `swift/` | macOS applications |
|
| 46 |
+
| [**iOS**](ios/) | `ios/` | Native iOS apps |
|
| 47 |
+
| [**Rust**](rust/) | `rust/` | Memory-safe systems |
|
| 48 |
+
|
| 49 |
+
> For detailed usage instructions, please refer to the README.md in each language directory.
|
| 50 |
+
|
| 51 |
+
## Getting Started
|
| 52 |
+
|
| 53 |
+
First, clone the repository:
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
git clone https://github.com/supertone-inc/supertonic.git
|
| 57 |
+
cd supertonic
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### Prerequisites
|
| 61 |
+
|
| 62 |
+
Before running the examples, download the ONNX models and preset voices, and place them in the `assets` directory:
|
| 63 |
+
|
| 64 |
+
```bash
|
| 65 |
+
git clone https://huggingface.co/Supertone/supertonic assets
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
> **Note:** The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
|
| 69 |
+
> - macOS: `brew install git-lfs && git lfs install`
|
| 70 |
+
> - Generic: see `https://git-lfs.com` for installers
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
### Technical Details
|
| 74 |
+
|
| 75 |
+
- **Runtime**: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
|
| 76 |
+
- **Browser Support**: onnxruntime-web for client-side inference
|
| 77 |
+
- **Batch Processing**: Supports batch inference for improved throughput
|
| 78 |
+
- **Audio Output**: Outputs 16-bit WAV files
|
| 79 |
+
|
| 80 |
+
## Performance
|
| 81 |
+
|
| 82 |
+
We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
|
| 83 |
+
|
| 84 |
+
**Metrics:**
|
| 85 |
+
- **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
|
| 86 |
+
- **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
|
| 87 |
+
|
| 88 |
+
### Characters per Second
|
| 89 |
+
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
| 90 |
+
|--------|-----------------|----------------|-----------------|
|
| 91 |
+
| **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
|
| 92 |
+
| **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
|
| 93 |
+
| **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
|
| 94 |
+
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
|
| 95 |
+
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
|
| 96 |
+
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
|
| 97 |
+
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
|
| 98 |
+
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
|
| 99 |
+
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
|
| 100 |
+
|
| 101 |
+
> **Notes:**
|
| 102 |
+
> `API` = Cloud-based API services (measured from Seoul)
|
| 103 |
+
> `Open` = Open-source models
|
| 104 |
+
> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
|
| 105 |
+
> Supertonic (RTX4090): Tested with PyTorch model
|
| 106 |
+
> Kokoro: Tested on M4 Pro CPU with ONNX
|
| 107 |
+
> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
|
| 108 |
+
|
| 109 |
+
### Real-time Factor
|
| 110 |
+
|
| 111 |
+
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
| 112 |
+
|--------|-----------------|----------------|-----------------|
|
| 113 |
+
| **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
|
| 114 |
+
| **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
|
| 115 |
+
| **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
|
| 116 |
+
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
|
| 117 |
+
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
|
| 118 |
+
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
|
| 119 |
+
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
|
| 120 |
+
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
|
| 121 |
+
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
|
| 122 |
+
|
| 123 |
+
<details>
|
| 124 |
+
<summary><b>Additional Performance Data (5-step inference)</b></summary>
|
| 125 |
+
|
| 126 |
+
<br>
|
| 127 |
+
|
| 128 |
+
**Characters per Second (5-step)**
|
| 129 |
+
|
| 130 |
+
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
| 131 |
+
|--------|-----------------|----------------|-----------------|
|
| 132 |
+
| **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
|
| 133 |
+
| **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
|
| 134 |
+
| **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
|
| 135 |
+
|
| 136 |
+
**Real-time Factor (5-step)**
|
| 137 |
+
|
| 138 |
+
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|
| 139 |
+
|--------|-----------------|----------------|-----------------|
|
| 140 |
+
| **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
|
| 141 |
+
| **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
|
| 142 |
+
| **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
|
| 143 |
+
|
| 144 |
+
</details>
|
| 145 |
+
|
| 146 |
+
## License
|
| 147 |
+
|
| 148 |
+
This project’s sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
|
| 149 |
+
|
| 150 |
+
The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://bigscience.huggingface.co/blog/bigscience-openrail-m) file for details.
|
| 151 |
+
|
| 152 |
+
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
|