Update README.md

b2c3a4b verified about 2 months ago

6.03 kB

	---
	base_model:
	- Supertone/supertonic
	pipeline_tag: text-to-speech
	---

	# Supertonic Quantized INT8 — Offline TTS (Shadow0482)

	This repository contains INT8 optimized ONNX models for the Supertonic Text-To-Speech
	pipeline. These models are quantized versions of the official Supertonic models and are
	designed for offline, low-latency, CPU-friendly inference.

	FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch
	(`float32` vs `float16`) in a `Div` node, so FP16 inference is not stable.
	Therefore, INT8 is the recommended format for real-world offline use.

	---

	# 🚀 Features

	### ✔ 100% Offline Execution
	No network needed. Load ONNX models directly using ONNX Runtime.

	### ✔ Full Supertonic Inference Stack
	- Text Encoder
	- Duration Predictor
	- Vector Estimator
	- Vocoder

	### ✔ INT8 Dynamic Quantization
	- Reduces model sizes dramatically
	- CPU-friendly inference
	- Very low memory usage
	- Compatible with ONNX Runtime CPUExecutionProvider

	### ✔ Same Audio Quality Text Output
	Produces understandable speech while being drastically faster on CPUs.

	---

	# 📦 Repository Structure

	```

	int8_dynamic/
	duration_predictor.int8.onnx
	text_encoder.int8.onnx
	vector_estimator.int8.onnx
	vocoder.int8.onnx

	fp16/
	(Contains experimental FP16 models — vocoder currently unstable)

	```

	Only the INT8 directory is guaranteed stable.

	---

	# 🔊 Test Sentence Used in Benchmark

	```

	Greetings! You are listening to your newly quantized model.
	I have been squished, squeezed, compressed, minimized, optimized,
	digitized, and lightly traumatized to save disk space.
	The testing framework automatically verifies my integrity,
	measures how much weight I lost,
	and checks if I can still talk without glitching into a robot dolphin.
	If you can hear this clearly, the quantization ritual was a complete success.

	```

	---

	# 📈 Benchmark Summary (CPU)

	\| Model \| Precision \| Time (s) \| Output \| Status \|
	\|-------\|-----------\|---------:\|--------\|--------\|
	\| INT8 Dynamic \| int8 \| _varies: ~3.0–7.0s_ \| `*.wav` \| ✅ OK \|
	\| FP32 (baseline) \| float32 \| ~2–4× slower \| `*.wav` \| ✅ OK \|
	\| FP16 \| mixed \| ❌ FAILED \| — \| 🚫 Cannot load vocoder \|

	---

	# 🖥️ Offline Inference Guide (Python)

	Below is a clean Python script to run fully offline INT8 inference.

	---

	# 🧩 Requirements

	```

	pip install onnxruntime numpy soundfile

	````

	---

	# 📜 offline_tts_int8.py

	```python
	import onnxruntime as ort
	import numpy as np
	import json
	import soundfile as sf
	from pathlib import Path

	# ---------------------------------------------------------
	# 1) CONFIG
	# ---------------------------------------------------------
	MODEL_DIR = Path("int8_dynamic") # folder containing *.int8.onnx
	VOICE_STYLE = "assets/voice_styles/M1.json"

	text_encoder_path = MODEL_DIR / "text_encoder.int8.onnx"
	duration_pred_path = MODEL_DIR / "duration_predictor.int8.onnx"
	vector_estimator_path = MODEL_DIR / "vector_estimator.int8.onnx"
	vocoder_path = MODEL_DIR / "vocoder.int8.onnx"

	TEST_TEXT = (
	"Hello! This is the INT8 offline version of Supertonic speaking. "
	"Everything you hear right now is running fully offline."
	)

	# ---------------------------------------------------------
	# 2) TOKENIZER LOADING
	# ---------------------------------------------------------
	unicode_path = Path("assets/onnx/unicode_indexer.json")
	tokenizer = json.load(open(unicode_path))

	def encode_text(text: str):
	ids = []
	for ch in text:
	if ch in tokenizer["token2idx"]:
	ids.append(tokenizer["token2idx"][ch])
	else:
	ids.append(tokenizer["token2idx"]["<unk>"])
	return np.array([ids], dtype=np.int64)

	# ---------------------------------------------------------
	# 3) LOAD MODELS (CPU)
	# ---------------------------------------------------------
	def load_session(model_path):
	return ort.InferenceSession(
	str(model_path),
	providers=["CPUExecutionProvider"]
	)

	sess_text = load_session(text_encoder_path)
	sess_dur = load_session(duration_pred_path)
	sess_vec = load_session(vector_estimator_path)
	sess_voc = load_session(vocoder_path)

	# ---------------------------------------------------------
	# 4) RUN TEXT ENCODER
	# ---------------------------------------------------------
	text_ids = encode_text(TEST_TEXT)
	text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
	style_ttl = np.zeros((1, 50, 256), dtype=np.float32)

	text_out = sess_text.run(
	None,
	{
	"text_ids": text_ids,
	"text_mask": text_mask,
	"style_ttl": style_ttl
	}
	)[0]

	# ---------------------------------------------------------
	# 5) RUN DURATION PREDICTOR
	# ---------------------------------------------------------
	style_dp = np.zeros((1, 8, 16), dtype=np.float32)

	dur_out = sess_dur.run(
	None,
	{
	"text_ids": text_ids,
	"text_mask": text_mask,
	"style_dp": style_dp
	}
	)[0]

	durations = np.maximum(dur_out.astype(int), 1)

	# ---------------------------------------------------------
	# 6) VECTOR ESTIMATOR
	# ---------------------------------------------------------
	latent = sess_vec.run(None, {"latent": text_out})[0]

	# ---------------------------------------------------------
	# 7) VOCODER → WAV
	# ---------------------------------------------------------
	wav = sess_voc.run(None, {"latent": latent})[0][0]

	sf.write("output_int8.wav", wav, 24000)
	print("Saved: output_int8.wav")
	````

	---

	# 🎧 Output

	After running:

	```
	python offline_tts_int8.py
	```

	You will get:

	```
	output_int8.wav
	```

	Playable offline on any system.

	---

	# 📝 Notes

	* Only the INT8 models are stable & recommended.
	* FP16 vocoder currently fails due to a type mismatch in a `Div` node.
	* No internet connection is required for INT8 inference.
	* These models are ideal for embedded or low-spec machines.

	---

	# 📄 License

	Models follow Supertone's licensing terms.
	Quantized versions follow the same licensing.