sybyrla / README.md
diaslmb's picture
Update README.md
3a7ebcb verified
metadata
language:
  - kk
  - ru
license: apache-2.0
tags:
  - automatic-speech-recognition
  - whisper
  - generated_from_trainer
  - kazakh
  - ksc2
  - common-voice
  - gemma-27b
datasets:
  - mozilla-foundation/common_voice_23_0
  - InflexionLab/ISSAI-KSC2-Structured
metrics:
  - wer
base_model: openai/whisper-large-v3
model-index:
  - name: whisper-large-v3-kazakh-ksc2
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Kazakh Speech Corpus 2 (KSC2)
          type: issai/ksc2
        metrics:
          - name: Wer
            type: wer
            value: 17.7

Whisper Large V3 Fine-tuned on KSC2 (Sybyrla)

This model is a fine-tuned version of openai/whisper-large-v3. It is designed to provide robust automatic speech recognition (ASR) for the Kazakh language, achieving a Word Error Rate (WER) of approximately 17.7%.

To handle real-world acoustic environments in the region, this model was trained on a strategic mix of Kazakh and Russian data.

Developed by: Inflexion Lab
License: Apache License 2.0

Model Description

  • Model type: Transformer-based sequence-to-sequence model (Whisper Large V3)
  • Language(s): Kazakh (kk), Russian (ru) auxiliary
  • Task: Automatic Speech Recognition (ASR)
  • Base Model: openai/whisper-large-v3

Performance

The model was evaluated on the held-out test split of the KSC2 dataset.

Metric Score
WER ~17.7%

Training Data & Methodology

The training dataset was curated to address specific challenges in Kazakh ASR, particularly the lack of punctuation in raw datasets and the prevalence of code-switching in daily speech.

Dataset Composition (80/20 Split)

We utilized a 80% / 20% data mixing strategy to prevent model degradation and improve stability when encountering non-Kazakh phonemes.

  1. Kazakh Speech Corpus 2 (KSC2) - ~80%

    • Volume: ~1,200 hours.
    • Processing: The original transcripts are in plain lowercase. We utilized Gemma 27B to syntactically restructure the text, restoring proper capitalization and punctuation.
    • Sources: Parliament speeches, TV/Radio broadcasts, podcasts, and crowdsourced recordings.
  2. Common Voice Scripted Speech 23.0 (Russian) - ~20%

    • Volume: ~250 hours.
    • Purpose: Including high-quality Russian speech helps the model distinguish between languages and handle loanwords or code-switching without hallucinating or degrading into gibberish.

Usage

Using with Hugging Face transformers

You can use this model directly with the Hugging Face pipeline.

from transformers import pipeline

# Load the pipeline
pipe = pipeline("automatic-speech-recognition", model="InflexionLab/sybyrla")

# Transcribe an audio file
# The pipeline handles chunking automatically if configured (see batch inference below).
result = pipe("path/to/your/audio.mp3")

print(result["text"])