Reference Audio Directory
This directory contains ground truth (GT) audio files for voice comparison.
Structure
reference_audio/
βββ meme/
β βββ hello_world.wav
β βββ inception.wav
β βββ ...
βββ song/
β βββ bohemian_rhapsody.wav
β βββ let_it_be.wav
β βββ ...
βββ movie/
βββ star_wars.wav
βββ titanic.wav
βββ ...
File Naming Convention
Format: {answer_word}.wav
- Use lowercase with underscores for multi-word answers
- Examples:
- "Hello World" β
hello_world.wav - "Inception" β
inception.wav - "Bohemian Rhapsody" β
bohemian_rhapsody.wav
- "Hello World" β
Audio Specifications
Recommended format:
- Format: WAV (uncompressed)
- Sample Rate: 16kHz or 44.1kHz
- Channels: Mono (1 channel)
- Bit Depth: 16-bit
- Duration: 2-10 seconds (clear pronunciation of the word/phrase)
Quality guidelines:
- Clear, professional pronunciation
- Minimal background noise
- Natural pace (not too fast or slow)
- Proper emphasis and intonation
How It Works
When a user submits audio for analysis:
- Backend loads:
reference_audio/{category}/{answer_word}.wav - VoiceKit MCP compares:
- User audio vs. Reference audio (GT)
- Returns similarity scores:
- Pitch, Rhythm, Energy, Pronunciation, Transcript, Overall
Example:
Puzzle: "Hello World" (category: meme)
Reference: reference_audio/meme/hello_world.wav
User says: "Hello World"
β VoiceKit compares and scores pronunciation similarity
Fallback Behavior
If reference audio is not found:
- Backend logs a warning
- Uses user audio as reference (comparison still works but less meaningful)
- Gemini still generates hints based on attempt number
Log example:
WARNING: Reference audio not found: reference_audio/meme/hello_world.wav, using user audio
Adding Reference Audio
Option 1: Record Your Own
Use any audio recording tool:
# Using arecord (Linux)
arecord -f S16_LE -r 16000 -c 1 -d 5 hello_world.wav
# Using ffmpeg (convert from any format)
ffmpeg -i input.mp3 -ar 16000 -ac 1 hello_world.wav
Option 2: Text-to-Speech (TTS)
Use ElevenLabs API or similar:
from elevenlabs import generate, save
audio = generate(
text="Hello World",
voice="Professional Male", # or any voice
model="eleven_monolingual_v1"
)
save(audio, "hello_world.wav")
Option 3: Extract from Media
For songs/movies, extract the specific phrase:
# Extract 5 seconds starting at 1:30
ffmpeg -i movie.mp4 -ss 00:01:30 -t 5 -ar 16000 -ac 1 star_wars.wav
Database Integration
Ensure the answer_word in your database matches the filename:
SELECT puzzle_number, answer_word, category FROM puzzles;
| puzzle_number | answer_word | category |
|---|---|---|
| 1 | hello_world | meme |
| 2 | inception | movie |
| 3 | bohemian_rhapsody | song |
File should be at:
reference_audio/meme/hello_world.wavreference_audio/movie/inception.wavreference_audio/song/bohemian_rhapsody.wav
Testing
Test reference audio loading:
from pathlib import Path
import base64
category = "meme"
answer_word = "hello_world"
path = Path(f"reference_audio/{category}/{answer_word}.wav")
if path.exists():
print(f"β Found: {path}")
with open(path, 'rb') as f:
audio_bytes = f.read()
print(f" Size: {len(audio_bytes)} bytes")
else:
print(f"β Not found: {path}")
Notes
- Reference audio files are gitignored (too large for git)
- Share reference audio separately (cloud storage, network drive, etc.)
- Each puzzle needs one reference audio file
- Quality of reference audio affects scoring accuracy