Generate word-level timestamps from audio
Align audio to word-level timestamps
Generate speech from text with emotional voices
Convert text to emotional speech
Generate audio from text with emotional voices