Bad audio
Using my audio dataset results in poor quality output. Are there specific requirements for the source audio? (I can provide you with examples)
24Khz mono channel .WAV files are recommended.
But I also had to modify the Gradio UI with these changes to fix the poor output quality...
In your UI ( .py - python file), change:
prompt_audio_upload = gr.Audio(..., type="numpy")
prompt_audio_record = gr.Audio(..., type="numpy")
to:
prompt_audio_upload = gr.Audio(sources=["upload"], type="filepath", label="Prompt audio (upload)")
prompt_audio_record = gr.Audio(sources=["microphone"], type="filepath", label="Prompt audio (record)")
Then update _write_prompt_to_temp_wav(...) to accept a path and read with soundfile:
data, sr = sf.read(prompt_path, dtype="float32", always_2d=True)
data = data.mean(axis=1)
These updates to the Gradio UI fixed the poor output quality I was having.
Btw here's the explanation of the fixes...
Because the model was never the problem — the UI pre-processing was.
In the CLI test you pointed CosyVoice3 at a real WAV file path, so it read the prompt audio correctly (proper scaling / no clipping / no Gradio conversions), and the clone locked in.
In the Gradio UI you were feeding type="numpy" prompt audio, and Gradio often hands back arrays that are effectively int16-scaled (peak around $32768$) or otherwise not normalized. Your UI then wrote that raw array out as float WAV, which clips/distorts the prompt, and CosyVoice can’t extract the right speaker identity from a mangled prompt—so it falls back to a “generic” voice.
So the “magic” was simply: bypass the Gradio audio pipeline and give CosyVoice the prompt clip cleanly.
try using the example python code, the gradio need some modification and may change result