Nadhari/swa-csm-1b
A fine-tuned version of Sesame's CSM-1B for Swahili text-to-speech with voice cloning capabilities.
Model Description
This model was fine-tuned on Mozilla Common Voice Swahili data (single female speaker, approximately 6,000 samples) to learn Swahili phonology and pronunciation patterns. It supports:
- Text-to-Speech: Generate natural Swahili speech from text
- Voice Cloning: Clone any voice using a short reference audio clip
- Zero-shot Generation: Generate speech without voice cloning context
Model Details
| Property | Value |
|---|---|
| Base Model | sesame/csm-1b |
| Parameters | ~1B |
| Audio Sample Rate | 24,000 Hz |
| Audio Codec | Mimi (32 codebooks) |
| Text Tokenizer | Llama-3 |
| Training Data | Mozilla Common Voice Swahili |
| Fine-tuning | Single female speaker (~6,000 samples) |
Requirements
CSM is available natively in Hugging Face Transformers as of version 4.52.1.
pip install transformers>=4.52.1 torch torchaudio
Usage
Generate a sentence
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "Nadhari/swa-csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
text = "[0]Habari za asubuhi. Karibu sana." # [0] for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)
# another equivalent way to prepare the inputs
conversation = [
{"role": "0", "content": [{"type": "text", "text": "Habari za asubuhi. Karibu sana."}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "output.wav")
CSM sounds best when provided with context (Voice Cloning)
import torch
import torchaudio
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "Nadhari/swa-csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# load and prepare reference audio (must be 24kHz)
ref_audio, sr = torchaudio.load("reference_voice.wav")
ref_audio = torchaudio.functional.resample(ref_audio.squeeze(0), sr, 24000)
# trim to 10 seconds max for optimal results
max_samples = 10 * 24000
if ref_audio.shape[0] > max_samples:
ref_audio = ref_audio[:max_samples]
# prepare conversation with context
conversation = [
# 1. context (reference audio with transcript)
{
"role": "0",
"content": [
{"type": "text", "text": "Hii ni sauti yangu ya kawaida nikizungumza Kiswahili."},
{"type": "audio", "audio": ref_audio.numpy()},
],
},
# 2. text prompt (what you want to generate)
{
"role": "0",
"content": [
{"type": "text", "text": "Napenda sana kupika chakula cha Kiswahili."},
],
},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "cloned_voice.wav")
Multi-segment Context
For better voice cloning quality, you can provide multiple reference segments:
import torch
import torchaudio
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "Nadhari/swa-csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# load multiple reference clips
def load_audio(path):
audio, sr = torchaudio.load(path)
audio = torchaudio.functional.resample(audio.squeeze(0), sr, 24000)
return audio.numpy()
# build conversation with multiple context segments
conversation = [
{
"role": "0",
"content": [
{"type": "text", "text": "Habari za asubuhi."},
{"type": "audio", "audio": load_audio("clip1.wav")},
],
},
{
"role": "0",
"content": [
{"type": "text", "text": "Karibu sana nyumbani kwetu."},
{"type": "audio", "audio": load_audio("clip2.wav")},
],
},
# text to generate
{
"role": "0",
"content": [
{"type": "text", "text": "Watoto wanapenda kucheza mpira wa miguu shuleni."},
],
},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "output.wav")
Generation Parameters
| Parameter | Default | Description |
|---|---|---|
do_sample |
True | Enable sampling for more natural speech |
temperature |
0.9 | Sampling temperature (0.7-1.2). Lower = more consistent |
depth_decoder_temperature |
0.9 | Temperature for audio decoder |
Example with Custom Parameters
# conservative generation (more consistent)
audio = model.generate(
**inputs,
output_audio=True,
do_sample=True,
temperature=0.7,
depth_decoder_temperature=0.7,
)
# expressive generation (more variation)
audio = model.generate(
**inputs,
output_audio=True,
do_sample=True,
temperature=1.1,
depth_decoder_temperature=1.1,
)
Best Practices
Reference Audio Guidelines
- Duration: 5-10 seconds optimal (minimum 3s, maximum 15s)
- Quality: Clean audio without background noise
- Sample Rate: Will be resampled to 24kHz
- Transcript: Must match the reference audio exactly
Text Input Guidelines
- Use proper punctuation for natural prosody
- Write numbers as words: "21" -> "ishirini na moja"
- Expand abbreviations: "Dr." -> "Daktari"
- Optimal length: 5-30 words per generation
Memory Requirements
- Model size: ~4GB VRAM
- Generation: +1-2GB depending on context length
- Recommended: GPU with 8GB+ VRAM
Training Details
Dataset
- Source: Mozilla Common Voice Swahili
- Speaker: Single female speaker with highest recording count
- Samples: ~6,000 audio-text pairs
- Split: 90% train / 10% validation
Training Configuration
| Parameter | Value |
|---|---|
| Batch Size | 4 |
| Gradient Accumulation | 4 |
| Learning Rate | 3e-5 |
| Weight Decay | 0.002 |
| Epochs | 3 |
| Optimizer | AdamW |
| Scheduler | Cosine Annealing |
| Precision | bfloat16 |
Training Infrastructure
- Hardware: NVIDIA A100 80GB
- Training Time: ~1-2 hours
- Framework: PyTorch + CSM codebase
Limitations
- Optimized for Swahili; other languages may have degraded quality
- Single-speaker fine-tuning means voice cloning quality may vary
- Long sentences (50+ words) may have reduced quality; split into shorter segments
- Does not generate text; use with a separate LLM for conversational applications
Sample Sentences for Testing
test_sentences = [
"Habari za asubuhi. Karibu sana nyumbani kwetu.",
"Watoto wanapenda kucheza mpira wa miguu shuleni.",
"Jua linawaka sana leo, tutaenda kuogelea baharini.",
"Napenda kupika pilau, maharage, mchicha na maandazi.",
"Asante sana kwa msaada wako. Tutaonana kesho.",
]
Misuse and Abuse
This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:
- Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
- Misinformation or Deception: Do not use this model to create deceptive or misleading content.
- Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.
By using this model, you agree to comply with all applicable laws and ethical guidelines.
Citation
@misc{nadhari-swa-csm-1b,
author = {Nadhari AI Lab},
title = {Swahili CSM-1B: Fine-tuned Text-to-Speech Model for Swahili},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Nadhari/swa-csm-1b}
}
@misc{sesame-csm-1b,
author = {Sesame},
title = {CSM-1B: Conversational Speech Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/sesame/csm-1b}
}
License
Apache 2.0
Acknowledgments
- Sesame for the original CSM-1B model
- Mozilla Common Voice for the Swahili dataset
- HuggingFace for model hosting
Contact
- Organization: Nadhari AI Lab
- Focus: AI development for Sub-Saharan Africa
- Downloads last month
- 239
Model tree for Nadhari/swa-csm-1b
Base model
sesame/csm-1b