Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)
This model is a fine-tuned version of openai/whisper-large-v3. It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately 17.7%.
To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.
Developed by: Inflexion Lab
License: Apache License 2.0
Model Description
- Model type: Transformer-based sequence-to-sequence model (Whisper Large V3)
- Language(s): Kazakh (kk), Russian (ru) auxiliary
- Task: Automatic Speech Recognition (ASR)
- Base Model:
openai/whisper-large-v3
Performance
The model was evaluated on the held-out test split of the KSC2 dataset.
| Metric | Score |
|---|---|
| WER | ~17.7% |
Training Data & Methodology
The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.
Dataset Composition (80/20 Split)
We utilized a 80% / 20% data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.
Kazakh Speech Corpus 2 (KSC2) - ~80%
- Volume: ~1,200 hours.
- Processing: The original transcripts are in plain lowercase. We utilized Gemma 27B to syntactically restructure the text, restoring proper capitalization and punctuation.
- Sources: Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.
Common Voice Scripted Speech 23.0 (Russian) - ~20%
- Volume: ~250 hours.
- Purpose: Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.
Usage
Using with Hugging Face transformers
You can use this model directly with the Hugging Face pipeline.
from transformers import pipeline
# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")
# Transcribe an audio file
# The pipeline handles chunking automatically if configured (see batch inference below).
result = pipe("path/to/your/audio.mp3")
print(result["text"])
- Downloads last month
- 70
Model tree for InflexionLab/sybyrla
Base model
openai/whisper-large-v3Dataset used to train InflexionLab/sybyrla
Evaluation results
- Wer on Kazakh Speech Corpus 2 (KSC2)self-reported17.700