|
|
--- |
|
|
base_model: |
|
|
- Supertone/supertonic |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# Supertonic Quantized INT8 β Offline TTS (Shadow0482) |
|
|
|
|
|
This repository contains **INT8 optimized ONNX models** for the Supertonic Text-To-Speech |
|
|
pipeline. These models are quantized versions of the official Supertonic models and are |
|
|
designed for **offline, low-latency, CPU-friendly inference**. |
|
|
|
|
|
FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch |
|
|
(`float32` vs `float16`) in a `Div` node, so FP16 inference is **not stable**. |
|
|
Therefore, **INT8 is the recommended format** for real-world offline use. |
|
|
|
|
|
--- |
|
|
|
|
|
# π Features |
|
|
|
|
|
### β 100% Offline Execution |
|
|
No network needed. Load ONNX models directly using ONNX Runtime. |
|
|
|
|
|
### β Full Supertonic Inference Stack |
|
|
- Text Encoder |
|
|
- Duration Predictor |
|
|
- Vector Estimator |
|
|
- Vocoder |
|
|
|
|
|
### β INT8 Dynamic Quantization |
|
|
- Reduces model sizes dramatically |
|
|
- CPU-friendly inference |
|
|
- Very low memory usage |
|
|
- Compatible with ONNX Runtime CPUExecutionProvider |
|
|
|
|
|
### β Same Audio Quality Text Output |
|
|
Produces understandable speech while being drastically faster on CPUs. |
|
|
|
|
|
--- |
|
|
|
|
|
# π¦ Repository Structure |
|
|
|
|
|
``` |
|
|
|
|
|
int8_dynamic/ |
|
|
duration_predictor.int8.onnx |
|
|
text_encoder.int8.onnx |
|
|
vector_estimator.int8.onnx |
|
|
vocoder.int8.onnx |
|
|
|
|
|
fp16/ |
|
|
(Contains experimental FP16 models β vocoder currently unstable) |
|
|
|
|
|
``` |
|
|
|
|
|
Only the **INT8 directory** is guaranteed stable. |
|
|
|
|
|
--- |
|
|
|
|
|
# π Test Sentence Used in Benchmark |
|
|
|
|
|
``` |
|
|
|
|
|
Greetings! You are listening to your newly quantized model. |
|
|
I have been squished, squeezed, compressed, minimized, optimized, |
|
|
digitized, and lightly traumatized to save disk space. |
|
|
The testing framework automatically verifies my integrity, |
|
|
measures how much weight I lost, |
|
|
and checks if I can still talk without glitching into a robot dolphin. |
|
|
If you can hear this clearly, the quantization ritual was a complete success. |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# π Benchmark Summary (CPU) |
|
|
|
|
|
| Model | Precision | Time (s) | Output | Status | |
|
|
|-------|-----------|---------:|--------|--------| |
|
|
| INT8 Dynamic | int8 | _varies: ~3.0β7.0s_ | `*.wav` | β
OK | |
|
|
| FP32 (baseline) | float32 | ~2β4Γ slower | `*.wav` | β
OK | |
|
|
| FP16 | mixed | β FAILED | β | π« Cannot load vocoder | |
|
|
|
|
|
--- |
|
|
|
|
|
# π₯οΈ Offline Inference Guide (Python) |
|
|
|
|
|
Below is a clean Python script to run **fully offline INT8 inference**. |
|
|
|
|
|
--- |
|
|
|
|
|
# π§© Requirements |
|
|
|
|
|
``` |
|
|
|
|
|
pip install onnxruntime numpy soundfile |
|
|
|
|
|
```` |
|
|
|
|
|
--- |
|
|
|
|
|
# π offline_tts_int8.py |
|
|
|
|
|
```python |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
import json |
|
|
import soundfile as sf |
|
|
from pathlib import Path |
|
|
|
|
|
# --------------------------------------------------------- |
|
|
# 1) CONFIG |
|
|
# --------------------------------------------------------- |
|
|
MODEL_DIR = Path("int8_dynamic") # folder containing *.int8.onnx |
|
|
VOICE_STYLE = "assets/voice_styles/M1.json" |
|
|
|
|
|
text_encoder_path = MODEL_DIR / "text_encoder.int8.onnx" |
|
|
duration_pred_path = MODEL_DIR / "duration_predictor.int8.onnx" |
|
|
vector_estimator_path = MODEL_DIR / "vector_estimator.int8.onnx" |
|
|
vocoder_path = MODEL_DIR / "vocoder.int8.onnx" |
|
|
|
|
|
TEST_TEXT = ( |
|
|
"Hello! This is the INT8 offline version of Supertonic speaking. " |
|
|
"Everything you hear right now is running fully offline." |
|
|
) |
|
|
|
|
|
# --------------------------------------------------------- |
|
|
# 2) TOKENIZER LOADING |
|
|
# --------------------------------------------------------- |
|
|
unicode_path = Path("assets/onnx/unicode_indexer.json") |
|
|
tokenizer = json.load(open(unicode_path)) |
|
|
|
|
|
def encode_text(text: str): |
|
|
ids = [] |
|
|
for ch in text: |
|
|
if ch in tokenizer["token2idx"]: |
|
|
ids.append(tokenizer["token2idx"][ch]) |
|
|
else: |
|
|
ids.append(tokenizer["token2idx"]["<unk>"]) |
|
|
return np.array([ids], dtype=np.int64) |
|
|
|
|
|
# --------------------------------------------------------- |
|
|
# 3) LOAD MODELS (CPU) |
|
|
# --------------------------------------------------------- |
|
|
def load_session(model_path): |
|
|
return ort.InferenceSession( |
|
|
str(model_path), |
|
|
providers=["CPUExecutionProvider"] |
|
|
) |
|
|
|
|
|
sess_text = load_session(text_encoder_path) |
|
|
sess_dur = load_session(duration_pred_path) |
|
|
sess_vec = load_session(vector_estimator_path) |
|
|
sess_voc = load_session(vocoder_path) |
|
|
|
|
|
# --------------------------------------------------------- |
|
|
# 4) RUN TEXT ENCODER |
|
|
# --------------------------------------------------------- |
|
|
text_ids = encode_text(TEST_TEXT) |
|
|
text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32) |
|
|
style_ttl = np.zeros((1, 50, 256), dtype=np.float32) |
|
|
|
|
|
text_out = sess_text.run( |
|
|
None, |
|
|
{ |
|
|
"text_ids": text_ids, |
|
|
"text_mask": text_mask, |
|
|
"style_ttl": style_ttl |
|
|
} |
|
|
)[0] |
|
|
|
|
|
# --------------------------------------------------------- |
|
|
# 5) RUN DURATION PREDICTOR |
|
|
# --------------------------------------------------------- |
|
|
style_dp = np.zeros((1, 8, 16), dtype=np.float32) |
|
|
|
|
|
dur_out = sess_dur.run( |
|
|
None, |
|
|
{ |
|
|
"text_ids": text_ids, |
|
|
"text_mask": text_mask, |
|
|
"style_dp": style_dp |
|
|
} |
|
|
)[0] |
|
|
|
|
|
durations = np.maximum(dur_out.astype(int), 1) |
|
|
|
|
|
# --------------------------------------------------------- |
|
|
# 6) VECTOR ESTIMATOR |
|
|
# --------------------------------------------------------- |
|
|
latent = sess_vec.run(None, {"latent": text_out})[0] |
|
|
|
|
|
# --------------------------------------------------------- |
|
|
# 7) VOCODER β WAV |
|
|
# --------------------------------------------------------- |
|
|
wav = sess_voc.run(None, {"latent": latent})[0][0] |
|
|
|
|
|
sf.write("output_int8.wav", wav, 24000) |
|
|
print("Saved: output_int8.wav") |
|
|
```` |
|
|
|
|
|
--- |
|
|
|
|
|
# π§ Output |
|
|
|
|
|
After running: |
|
|
|
|
|
``` |
|
|
python offline_tts_int8.py |
|
|
``` |
|
|
|
|
|
You will get: |
|
|
|
|
|
``` |
|
|
output_int8.wav |
|
|
``` |
|
|
|
|
|
Playable offline on any system. |
|
|
|
|
|
--- |
|
|
|
|
|
# π Notes |
|
|
|
|
|
* Only the **INT8** models are stable & recommended. |
|
|
* FP16 vocoder currently fails due to a type mismatch in a `Div` node. |
|
|
* No internet connection is required for INT8 inference. |
|
|
* These models are ideal for embedded or low-spec machines. |
|
|
|
|
|
--- |
|
|
|
|
|
# π License |
|
|
|
|
|
Models follow Supertone's licensing terms. |
|
|
Quantized versions follow the same licensing. |