πŸ‡°πŸ‡¬ Whisper Small - Kyrgyz, English, Russian Speech To Text model

Model Description

kyrgyz-whisper-small is a fine-tuned multilingual speech recognition model based on OpenAI's Whisper Small architecture. This model adds native Kyrgyz language support while maintaining strong performance on English and Russian.

Key Features

  • Kyrgyz language support via custom <|ky|> token.
  • Multilingual: Kyrgyz, English, and Russian.
  • Trained on ~2,000 hours of Kyrgyz audio + 40% English/Russian audio
  • Ready for further improvement with LoRA fine-tuning (see Colab Notebook)
  • Optimized for real-world, noisy audio conditions

Performance

WER Distributions on FLEURS Benchmark

The following visualization shows improvement after fine-tuning:

plot_smal

Key Observations:

  • Kyrgyz: Dramatic improvement from ~100% WER (unusable) β†’ practical performance with peak around 0.2-0.4 WER
  • English & Russian: Some performance degradation compared to base model as trade-off for Kyrgyz support
    • Distributions shifted right (higher WER)
    • This is expected when adding a new language to a fixed-capacity model
  • Multi-language trade-off: The model sacrifices some accuracy on English/Russian to gain Kyrgyz capabilities
  • Benchmark Fleurs

Recommended Use Cases

  • Kyrgyz media transcription
  • Multilingual call centers
  • Educational content in Kyrgyz
  • Code-switching scenarios (common in Kyrgyzstan where people mix languages)
  • Foundation model for LoRA fine-tuning on clean Kyrgyz data

Technical Implementation

Custom Tokenizer Integration

from transformers import AutoTokenizer

# Load custom tokenizer with Kyrgyz support
tokenizer = AutoTokenizer.from_pretrained(
    "nineninesix/kyrgyz-whisper-small",
    trust_remote_code=True, ### !!! important !!!
    language="kyrgyz",
    task="transcribe"
)

Kyrgyz Token Initialization

The <|ky|> token was initialized as an average of embeddings from linguistically similar languages:

embedding_ky = (embedding_ru + embedding_kk + embedding_tr) / 3

Usage

Pipeline Usage


import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline, WhisperFeatureExtractor, AutoTokenizer

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "nineninesix/kyrgyz-whisper-small"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
model.to(device)

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True, language="kyrgyz", task="transcribe")

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    torch_dtype=torch_dtype,
    device=device
)

result = pipe("audio.mp3")
print(result['text'])

Further Fine-tuning with LoRA

This model serves as a foundation for domain-specific fine-tuning using LoRA (Low-Rank Adaptation).

Unsloth integration example: see this Google Colab

Benefits of LoRA fine-tuning:

  • Adapt to specific domains (medical, legal, conversational)
  • Memory-efficient training
  • Faster training than full fine-tuning
  • Improved accuracy on clean datasets

Limitations

  • Trained on noisy data - may have higher WER on clean benchmarks vs. clean-trained models
  • Best performance on Kyrgyz, English, and Russian (other languages not supported)
  • Requires custom tokenizer for Kyrgyz language support
  • May require domain-specific fine-tuning for specialized applications

Citation

@misc{kyrgyz-whisper-small,
  author = {nineninesix},
  title = {Whisper Small - Kyrgyz, English, Russian},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/nineninesix/kyrgyz-whisper-small}
}

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Acknowledgments

  • Based on OpenAI's Whisper architecture
  • Kyrgyz tokenizer: kyrgyz-ai/whisper_tokenizer_ky
  • Training datasets: Kyrgyz ASR community contributions
  • Inspired by multilingual ASR research

License

Apache 2.0 - see LICENSE file for details.

Downloads last month
31
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nineninesix/kyrgyz-whisper-small

Finetuned
(3075)
this model

Space using nineninesix/kyrgyz-whisper-small 1