Nadhari/swa-csm-1b

A fine-tuned version of Sesame's CSM-1B for Swahili text-to-speech with voice cloning capabilities.

Model Description

This model was fine-tuned on Mozilla Common Voice Swahili data (single female speaker, approximately 6,000 samples) to learn Swahili phonology and pronunciation patterns. It supports:

  • Text-to-Speech: Generate natural Swahili speech from text
  • Voice Cloning: Clone any voice using a short reference audio clip
  • Zero-shot Generation: Generate speech without voice cloning context

Model Details

Property Value
Base Model sesame/csm-1b
Parameters ~1B
Audio Sample Rate 24,000 Hz
Audio Codec Mimi (32 codebooks)
Text Tokenizer Llama-3
Training Data Mozilla Common Voice Swahili
Fine-tuning Single female speaker (~6,000 samples)

Requirements

CSM is available natively in Hugging Face Transformers as of version 4.52.1.

pip install transformers>=4.52.1 torch torchaudio

Usage

Generate a sentence

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor

model_id = "Nadhari/swa-csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs
text = "[0]Habari za asubuhi. Karibu sana."  # [0] for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)

# another equivalent way to prepare the inputs
conversation = [
    {"role": "0", "content": [{"type": "text", "text": "Habari za asubuhi. Karibu sana."}]},
]
inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "output.wav")

CSM sounds best when provided with context (Voice Cloning)

import torch
import torchaudio
from transformers import CsmForConditionalGeneration, AutoProcessor

model_id = "Nadhari/swa-csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# load and prepare reference audio (must be 24kHz)
ref_audio, sr = torchaudio.load("reference_voice.wav")
ref_audio = torchaudio.functional.resample(ref_audio.squeeze(0), sr, 24000)

# trim to 10 seconds max for optimal results
max_samples = 10 * 24000
if ref_audio.shape[0] > max_samples:
    ref_audio = ref_audio[:max_samples]

# prepare conversation with context
conversation = [
    # 1. context (reference audio with transcript)
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "Hii ni sauti yangu ya kawaida nikizungumza Kiswahili."},
            {"type": "audio", "audio": ref_audio.numpy()},
        ],
    },
    # 2. text prompt (what you want to generate)
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "Napenda sana kupika chakula cha Kiswahili."},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "cloned_voice.wav")

Multi-segment Context

For better voice cloning quality, you can provide multiple reference segments:

import torch
import torchaudio
from transformers import CsmForConditionalGeneration, AutoProcessor

model_id = "Nadhari/swa-csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# load multiple reference clips
def load_audio(path):
    audio, sr = torchaudio.load(path)
    audio = torchaudio.functional.resample(audio.squeeze(0), sr, 24000)
    return audio.numpy()

# build conversation with multiple context segments
conversation = [
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "Habari za asubuhi."},
            {"type": "audio", "audio": load_audio("clip1.wav")},
        ],
    },
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "Karibu sana nyumbani kwetu."},
            {"type": "audio", "audio": load_audio("clip2.wav")},
        ],
    },
    # text to generate
    {
        "role": "0",
        "content": [
            {"type": "text", "text": "Watoto wanapenda kucheza mpira wa miguu shuleni."},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "output.wav")

Generation Parameters

Parameter Default Description
do_sample True Enable sampling for more natural speech
temperature 0.9 Sampling temperature (0.7-1.2). Lower = more consistent
depth_decoder_temperature 0.9 Temperature for audio decoder

Example with Custom Parameters

# conservative generation (more consistent)
audio = model.generate(
    **inputs,
    output_audio=True,
    do_sample=True,
    temperature=0.7,
    depth_decoder_temperature=0.7,
)

# expressive generation (more variation)
audio = model.generate(
    **inputs,
    output_audio=True,
    do_sample=True,
    temperature=1.1,
    depth_decoder_temperature=1.1,
)

Best Practices

Reference Audio Guidelines

  • Duration: 5-10 seconds optimal (minimum 3s, maximum 15s)
  • Quality: Clean audio without background noise
  • Sample Rate: Will be resampled to 24kHz
  • Transcript: Must match the reference audio exactly

Text Input Guidelines

  • Use proper punctuation for natural prosody
  • Write numbers as words: "21" -> "ishirini na moja"
  • Expand abbreviations: "Dr." -> "Daktari"
  • Optimal length: 5-30 words per generation

Memory Requirements

  • Model size: ~4GB VRAM
  • Generation: +1-2GB depending on context length
  • Recommended: GPU with 8GB+ VRAM

Training Details

Dataset

  • Source: Mozilla Common Voice Swahili
  • Speaker: Single female speaker with highest recording count
  • Samples: ~6,000 audio-text pairs
  • Split: 90% train / 10% validation

Training Configuration

Parameter Value
Batch Size 4
Gradient Accumulation 4
Learning Rate 3e-5
Weight Decay 0.002
Epochs 3
Optimizer AdamW
Scheduler Cosine Annealing
Precision bfloat16

Training Infrastructure

  • Hardware: NVIDIA A100 80GB
  • Training Time: ~1-2 hours
  • Framework: PyTorch + CSM codebase

Limitations

  • Optimized for Swahili; other languages may have degraded quality
  • Single-speaker fine-tuning means voice cloning quality may vary
  • Long sentences (50+ words) may have reduced quality; split into shorter segments
  • Does not generate text; use with a separate LLM for conversational applications

Sample Sentences for Testing

test_sentences = [
    "Habari za asubuhi. Karibu sana nyumbani kwetu.",
    "Watoto wanapenda kucheza mpira wa miguu shuleni.",
    "Jua linawaka sana leo, tutaenda kuogelea baharini.",
    "Napenda kupika pilau, maharage, mchicha na maandazi.",
    "Asante sana kwa msaada wako. Tutaonana kesho.",
]

Misuse and Abuse

This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

  • Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
  • Misinformation or Deception: Do not use this model to create deceptive or misleading content.
  • Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines.


Citation

@misc{nadhari-swa-csm-1b,
  author = {Nadhari AI Lab},
  title = {Swahili CSM-1B: Fine-tuned Text-to-Speech Model for Swahili},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Nadhari/swa-csm-1b}
}

@misc{sesame-csm-1b,
  author = {Sesame},
  title = {CSM-1B: Conversational Speech Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/sesame/csm-1b}
}

License

Apache 2.0


Acknowledgments


Contact

  • Organization: Nadhari AI Lab
  • Focus: AI development for Sub-Saharan Africa
Downloads last month
239
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nadhari/swa-csm-1b

Base model

sesame/csm-1b
Finetuned
(22)
this model

Dataset used to train Nadhari/swa-csm-1b