import gradio as gr from gradio_client import Client import os import csv import numpy as np import scipy.io.wavfile as wavfile import tempfile client = Client(os.environ['src']) css = """ .gradio-container input::placeholder, .gradio-container textarea::placeholder { color: #333333 !important; } code { background-color: #ffde9f; padding: 2px 4px; border-radius: 3px; } #settings-accordion summary { justify-content: center; } .examples-holder > .label { color: #b45309 !important; font-weight: 600; } .audio-warning { color: #ff6b35 !important; font-weight: 600; margin: 10px 0; } .audio-error { color: #dc2626 !important; font-weight: 600; margin: 10px 0; } """ def validate_audio_duration(audio_data): """ Validate audio duration and return appropriate message Returns: (is_valid, warning_message) """ if audio_data is None: return True, "" sample_rate, audio_array = audio_data duration_seconds = len(audio_array) / sample_rate if duration_seconds > 10: error_msg = f"""
⚠️ This demo doesn't have load balancing or parallel query handling. You must wait for everyone else to finish first during busy times. Sorry!
Takane is a frontier Japanese-only speech synthesis network that was trained on tens of thousands of high quality data to autoregressively generate highly compressed audio codes. This network is powered by Kanadec, the world's only 44.1 kHz - 25 frame rate speech tokenizer which utilizes semantic and acoustic distillation to generate audio tokens as fast as possible.
There are two checkpoints in this demo, one of them utilizes a custom version of Rope to manipulate duration which is seldom seen in autoregressive settings. Please treat it as a proof of concept as its outputs are not very reliable. I'll include it to show that it can work to some levels and can be expanded upon. Both checkpoints have been fine-tuned on a subset of the dataset with only speaker tags. This will allow us to generate high quality samples without relying on audio prompts or dealing with random speaker attributes, but at the cost of tanking the zero-shot faithfulness of the model.
Takane also comes with an Anti-Hallucination Algorithm (AHA) that generates a few candidates in parallel and automatically returns the best one at the cost of introducing a small overhead. If you need the fastest response time possible, feel free to enable the Turbo mode. It will disable AHA and tweak the parameters internally to produce samples as fast as 2-3 seconds (though due to an influx of users coming in, you probably will be qeued and have to wait!)
There's no plan to release this model for now.
If you're not using an audio prompt or a speaker tag, or even if you do, you find the later sentences to be too different, then in that case you may want to enable the Chained mode, which will sequentially condition each output to ensure speaker consistency.
🌸 Takane - Advanced Japanese Text-to-Speech System