metadata
license: mit
language:
- rw
base_model:
- openai/whisper-large-v3-turbo
tags:
- speech
- transcription
metrics:
- wer
- cer
Whisper Large v3 Turbo (Kinyarwanda)
Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It can transcribe and translate spoken language into text with high accuracy, supporting multiple languages, accents, and noisy environments. It is designed for general-purpose speech processing and can handle various audio inputs.
Whisper-large-v3-turbo is an optimized version of OpenAI's Whisper-large-v3 model, designed to enhance transcription speed while maintaining high accuracy. This optimization is achieved by reducing the number of decoder layers from 32 to 4, resulting in a model that is significantly faster with only a minor decrease in transcription quality.
More details
Fine-tune
I have successfully fine-tuned the Whisper-large-v3-turbo model on the Kinyarwanda ASR Track A dataset, consisting of over 90000 audio files, ranging from 10 to 40 seconds, each accompanied by its corresponding text transcription.
Before fine-tuning our model with the dataset, the recordings, originally encoded using the Opus codec and stored in WebM (Matroska) format at a 48,000 Hz sample rate, were converted to .wav files with a 16,000 Hz sample rate to align with the model’s input requirements.
Configuration
- Trainable layers = encoder - 15 (progressively unfrozen 2 layers every 2 epochs), decoder - 4
- Learning rate = 7e-6
- Batch size = 2 (for both dataloaders)
- Gradient accumulation steps = 8
- Optimizer = AdamW
- Weight decay = 0.1
- Epochs = 10
- Scheduler = Linear (with warmup = 0.05)
Dropout:
- Encoder =
- 0.3 if idx == 20 else
- 0.2 if idx in [21, 22, 29, 30] else 0.0
- Decoder =
- 0.3 if idx == 1 else 0.1
- 0.3 if idx == 20 else
- 0.2 if idx in [21, 22, 29, 30] else 0.0
- 0.3 if idx == 1 else 0.1
Early Stopping: patience=3, min_delta=0.0005
The condition for saving the model is that the test loss, Word Error Rate (WER), and Character Error Rate (CER) must be lower than the previously recorded best values.
Results

The fine-tuned model was saved at epoch 5 with:
- WER: 16.11%
- CER: 3.28%
How to use
If you want to transcribe a mono-channel audio file (.wav) containing a single speaker, use the following code:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
import torch
model_name = "ionut-visan/whisper-large-v3-turbo_kinyarwanda500"
# Load processor and model
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
def preprocess_audio(audio_path, processor):
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
return {key: val.to(device) for key, val in inputs.items()}
def transcribe(audio_path, model, processor):
"""Generate transcription."""
inputs = preprocess_audio(audio_path, processor)
with torch.no_grad():
generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
return transcription[0]
# Define audio path
audio_file = "audio.wav"
transcription = transcribe(audio_file, model, processor)
print("Transcription:", transcription)