SJLee-0525
[CHORE] test9
9f30ef0
|
raw
history blame
3.79 kB

Reference Audio Directory

This directory contains ground truth (GT) audio files for voice comparison.

Structure

reference_audio/
β”œβ”€β”€ meme/
β”‚   β”œβ”€β”€ hello_world.wav
β”‚   β”œβ”€β”€ inception.wav
β”‚   └── ...
β”œβ”€β”€ song/
β”‚   β”œβ”€β”€ bohemian_rhapsody.wav
β”‚   β”œβ”€β”€ let_it_be.wav
β”‚   └── ...
└── movie/
    β”œβ”€β”€ star_wars.wav
    β”œβ”€β”€ titanic.wav
    └── ...

File Naming Convention

Format: {answer_word}.wav

  • Use lowercase with underscores for multi-word answers
  • Examples:
    • "Hello World" β†’ hello_world.wav
    • "Inception" β†’ inception.wav
    • "Bohemian Rhapsody" β†’ bohemian_rhapsody.wav

Audio Specifications

Recommended format:

  • Format: WAV (uncompressed)
  • Sample Rate: 16kHz or 44.1kHz
  • Channels: Mono (1 channel)
  • Bit Depth: 16-bit
  • Duration: 2-10 seconds (clear pronunciation of the word/phrase)

Quality guidelines:

  • Clear, professional pronunciation
  • Minimal background noise
  • Natural pace (not too fast or slow)
  • Proper emphasis and intonation

How It Works

When a user submits audio for analysis:

  1. Backend loads: reference_audio/{category}/{answer_word}.wav
  2. VoiceKit MCP compares:
    • User audio vs. Reference audio (GT)
  3. Returns similarity scores:
    • Pitch, Rhythm, Energy, Pronunciation, Transcript, Overall

Example:

Puzzle: "Hello World" (category: meme)
Reference: reference_audio/meme/hello_world.wav
User says: "Hello World"
β†’ VoiceKit compares and scores pronunciation similarity

Fallback Behavior

If reference audio is not found:

  • Backend logs a warning
  • Uses user audio as reference (comparison still works but less meaningful)
  • Gemini still generates hints based on attempt number

Log example:

WARNING: Reference audio not found: reference_audio/meme/hello_world.wav, using user audio

Adding Reference Audio

Option 1: Record Your Own

Use any audio recording tool:

# Using arecord (Linux)
arecord -f S16_LE -r 16000 -c 1 -d 5 hello_world.wav

# Using ffmpeg (convert from any format)
ffmpeg -i input.mp3 -ar 16000 -ac 1 hello_world.wav

Option 2: Text-to-Speech (TTS)

Use ElevenLabs API or similar:

from elevenlabs import generate, save

audio = generate(
    text="Hello World",
    voice="Professional Male",  # or any voice
    model="eleven_monolingual_v1"
)
save(audio, "hello_world.wav")

Option 3: Extract from Media

For songs/movies, extract the specific phrase:

# Extract 5 seconds starting at 1:30
ffmpeg -i movie.mp4 -ss 00:01:30 -t 5 -ar 16000 -ac 1 star_wars.wav

Database Integration

Ensure the answer_word in your database matches the filename:

SELECT puzzle_number, answer_word, category FROM puzzles;
puzzle_number answer_word category
1 hello_world meme
2 inception movie
3 bohemian_rhapsody song

File should be at:

  • reference_audio/meme/hello_world.wav
  • reference_audio/movie/inception.wav
  • reference_audio/song/bohemian_rhapsody.wav

Testing

Test reference audio loading:

from pathlib import Path
import base64

category = "meme"
answer_word = "hello_world"
path = Path(f"reference_audio/{category}/{answer_word}.wav")

if path.exists():
    print(f"βœ“ Found: {path}")
    with open(path, 'rb') as f:
        audio_bytes = f.read()
        print(f"  Size: {len(audio_bytes)} bytes")
else:
    print(f"βœ— Not found: {path}")

Notes

  • Reference audio files are gitignored (too large for git)
  • Share reference audio separately (cloud storage, network drive, etc.)
  • Each puzzle needs one reference audio file
  • Quality of reference audio affects scoring accuracy