Spaces:

MCP-1st-Birthday
/

VoiceSementle

Running

App Files Files Community

VoiceSementle / reference_audio /README.md

SJLee-0525

[CHORE] test9

9f30ef0 13 days ago

preview code

raw

history blame

3.79 kB

Reference Audio Directory

This directory contains ground truth (GT) audio files for voice comparison.

Structure

reference_audio/
├── meme/
│   ├── hello_world.wav
│   ├── inception.wav
│   └── ...
├── song/
│   ├── bohemian_rhapsody.wav
│   ├── let_it_be.wav
│   └── ...
└── movie/
    ├── star_wars.wav
    ├── titanic.wav
    └── ...

File Naming Convention

Format: {answer_word}.wav

Use lowercase with underscores for multi-word answers
Examples:
- "Hello World" → hello_world.wav
- "Inception" → inception.wav
- "Bohemian Rhapsody" → bohemian_rhapsody.wav

Audio Specifications

Recommended format:

Format: WAV (uncompressed)
Sample Rate: 16kHz or 44.1kHz
Channels: Mono (1 channel)
Bit Depth: 16-bit
Duration: 2-10 seconds (clear pronunciation of the word/phrase)

Quality guidelines:

Clear, professional pronunciation
Minimal background noise
Natural pace (not too fast or slow)
Proper emphasis and intonation

How It Works

When a user submits audio for analysis:

Backend loads: reference_audio/{category}/{answer_word}.wav
VoiceKit MCP compares:
- User audio vs. Reference audio (GT)
Returns similarity scores:
- Pitch, Rhythm, Energy, Pronunciation, Transcript, Overall

Example:

Puzzle: "Hello World" (category: meme)
Reference: reference_audio/meme/hello_world.wav
User says: "Hello World"
→ VoiceKit compares and scores pronunciation similarity

Fallback Behavior

If reference audio is not found:

Backend logs a warning
Uses user audio as reference (comparison still works but less meaningful)
Gemini still generates hints based on attempt number

Log example:

WARNING: Reference audio not found: reference_audio/meme/hello_world.wav, using user audio

Adding Reference Audio

Option 1: Record Your Own

Use any audio recording tool:

# Using arecord (Linux)
arecord -f S16_LE -r 16000 -c 1 -d 5 hello_world.wav

# Using ffmpeg (convert from any format)
ffmpeg -i input.mp3 -ar 16000 -ac 1 hello_world.wav

Option 2: Text-to-Speech (TTS)

Use ElevenLabs API or similar:

from elevenlabs import generate, save

audio = generate(
    text="Hello World",
    voice="Professional Male",  # or any voice
    model="eleven_monolingual_v1"
)
save(audio, "hello_world.wav")

Option 3: Extract from Media

For songs/movies, extract the specific phrase:

# Extract 5 seconds starting at 1:30
ffmpeg -i movie.mp4 -ss 00:01:30 -t 5 -ar 16000 -ac 1 star_wars.wav

Database Integration

Ensure the answer_word in your database matches the filename:

SELECT puzzle_number, answer_word, category FROM puzzles;

puzzle_number	answer_word	category
1	hello_world	meme
2	inception	movie
3	bohemian_rhapsody	song

File should be at:

reference_audio/meme/hello_world.wav
reference_audio/movie/inception.wav
reference_audio/song/bohemian_rhapsody.wav

Testing

Test reference audio loading:

from pathlib import Path
import base64

category = "meme"
answer_word = "hello_world"
path = Path(f"reference_audio/{category}/{answer_word}.wav")

if path.exists():
    print(f"✓ Found: {path}")
    with open(path, 'rb') as f:
        audio_bytes = f.read()
        print(f"  Size: {len(audio_bytes)} bytes")
else:
    print(f"✗ Not found: {path}")

Notes

Reference audio files are gitignored (too large for git)
Share reference audio separately (cloud storage, network drive, etc.)
Each puzzle needs one reference audio file
Quality of reference audio affects scoring accuracy