Spark-TTS Arabic

ู†ู…ูˆุฐุฌ ุชุญูˆูŠู„ ุงู„ู†ุต ุฅู„ู‰ ูƒู„ุงู… ุจุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ

Arabic text-to-speech model fine-tuned on 300 hours of clean Arabic audio data. Delivers consistent, high-quality speech synthesis for Modern Standard Arabic with full diacritization.

Model Details

Training Data: ~300 hours of clean Arabic audio
Language: Modern Standard Arabic (MSA)
Sample Rate: 24kHz

Usage

Quick Start

see the Colab notebook. HF space : Arabic Spark TTS Space.

from transformers import AutoProcessor, AutoModel
import soundfile as sf
import torch

# Load model
model_id = "IbrahimSalah/Arabic-TTS-Spark"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval().to(device)

# Prepare inputs
inputs = processor(
    text="YOUR_TEXT_WITH_TASHKEEL",
    prompt_speech_path="path/to/reference.wav",
    prompt_text="REFERENCE_TEXT_WITH_TASHKEEL",
    return_tensors="pt"
).to(device)

# Generate
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=8000, temperature=0.8)

# Decode
output = processor.decode(generated_ids=output_ids)
sf.write("output.wav", output["audio"], output["sampling_rate"])

Key Features

  • High-quality Arabic speech synthesis with natural prosody
  • Efficient voice cloning from reference audio
  • Advanced text chunking for long-form content
  • Built-in audio post-processing (normalization, silence removal, crossfading)
  • Works best with moderate text lengths
  • Adjustable generation parameters (temperature, top_k, top_p)

Input Requirements

Critical: Text must include full Arabic diacritization (tashkeel). The model is trained exclusively on fully diacritized text and will not perform well on non-diacritized input.

Example of correct input:

ุฅูู†ูŽู‘ ุงู„ู’ุนูู„ู’ู…ูŽ ู†ููˆุฑูŒ ูŠูู‚ู’ุฐูŽูู ูููŠ ุงู„ู’ู‚ูŽู„ู’ุจู

Generation Parameters

tts.generate_long_text(
    text=your_text,
    prompt_audio_path="reference.wav",
    prompt_transcript="reference_text",
    output_path="output.wav",
    max_chunk_length=300,        # Characters per chunk
    crossfade_duration=0.08,     # Crossfade duration in seconds
    normalize_audio_flag=True,
    remove_silence_flag=True,
    temperature=0.8,             # Generation randomness
    top_p=0.95,                  # Nucleus sampling
    top_k=50                     # Top-k sampling
)

Sample Output

Text: "ุฅูู†ูŽู‘ ุงู„ุฏูŽู‘ูˆู’ู„ูŽุฉูŽ ู„ูŽู‡ูŽุง ุฃูŽุนู’ู…ูŽุงุฑูŒ ุทูŽุจููŠุนููŠูŽู‘ุฉูŒ ูƒูŽู…ูŽุง ู„ูู„ู’ุฃูŽุดู’ุฎูŽุงุตู. ูˆูŽุฃูŽู†ูŽู‘ู‡ูŽุง ุชูŽู†ู’ุชูŽู‚ูู„ู ูููŠ ุฃูŽุทู’ูˆูŽุงุฑู ู…ูุฎู’ุชูŽู„ูููŽุฉูุŒ ููŽูŠูŽูƒููˆู†ู ุงู„ู’ุฌููŠู„ู ุงู„ู’ุฃูŽูˆูŽู‘ู„ู ู…ูู†ู’ ุฃูŽู‡ู’ู„ู ุงู„ุฏูŽู‘ูˆู’ู„ูŽุฉูุŒ ู‚ูŽุฏู’ ุญูŽุงููŽุธููˆุง ุนูŽู„ูŽู‰ ุงู„ู’ุฎูุดููˆู†ูŽุฉู ุงู„ู’ุจูŽุฏูŽูˆููŠูŽู‘ุฉูุŒ ูˆูŽุงู„ุชูŽู‘ูˆูŽุญูู‘ุดูุŒ ูˆูŽุงู„ุดูŽู‘ุธูŽููุŒ ูˆูŽุงู„ู’ุจูŽุฃู’ุณูุŒ ูˆูŽุงู„ูุงุดู’ุชูุฑูŽุงูƒู ูููŠ ุงู„ู’ู…ูŽุฌู’ุฏู. ููŽุชูŽูƒููˆู†ู ุญูุฏููˆุฏูู‡ูู…ู’ ู…ูŽุฑู’ู‡ููˆุจูŽุฉู‹ุŒ ูˆูŽุฌูŽูˆูŽุงู†ูุจูู‡ูู…ู’ ู…ูุนูŽุฒูŽู‘ุฒูŽุฉู‹. ุซูู…ูŽู‘ ูŠูŽุฃู’ุชููŠ ุงู„ู’ุฌููŠู„ู ุงู„ุซูŽู‘ุงู†ููŠุŒ ููŽูŠูŽุชูŽุญูŽูˆูŽู‘ู„ู ุญูŽุงู„ูู‡ูู…ู’ ุจูุงู„ู’ู…ูู„ู’ูƒู ูˆูŽุงู„ุชูŽู‘ุฑูŽูู ู…ูู†ูŽ ุงู„ู’ุจูŽุฏูŽุงูˆูŽุฉู ุฅูู„ูŽู‰ ุงู„ู’ุญูŽุถูŽุงุฑูŽุฉูุŒ ูˆูŽู…ูู†ูŽ ุงู„ู’ุฎูุดููˆู†ูŽุฉู ุฅูู„ูŽู‰ ุงู„ุชูŽู‘ุฑูŽูู. ููŽูŠูŽู†ู’ูƒูŽุณูุฑู ุณูŽูˆู’ุฑูŽุฉู ุงู„ู’ุนูŽุตูŽุจููŠูŽู‘ุฉู ู‚ูŽู„ููŠู„ู‹ุง. ุซูู…ูŽู‘ ูŠูŽุฃู’ุชููŠ ุงู„ู’ุฌููŠู„ู ุงู„ุซูŽู‘ุงู„ูุซูุŒ ููŽูŠูŽูƒููˆู†ููˆู†ูŽ ู‚ูŽุฏู’ ู†ูŽุณููˆุง ุนูŽู‡ู’ุฏูŽ ุงู„ู’ุจูŽุฏูŽุงูˆูŽุฉู ูˆูŽุงู„ู’ุฎูุดููˆู†ูŽุฉูุŒ ูˆูŽูŠูŽู†ู’ุบูŽู…ูุณููˆู†ูŽ ูููŠ ุงู„ู†ูŽู‘ุนููŠู…ู ูˆูŽุงู„ุชูŽู‘ุฑูŽููุŒ ูˆูŽูŠูŽุตููŠุฑููˆู†ูŽ ุนููŠูŽุงู„ู‹ุง ุนูŽู„ูŽู‰ ุงู„ุฏูŽู‘ูˆู’ู„ูŽุฉู. ููŽูŠูŽุณู’ู‚ูุทููˆู†ูŽ ูููŠ ุงู„ู’ู‡ูŽุฑูŽู…ู ูˆูŽุงู„ุฒูŽู‘ูˆูŽุงู„ูุŒ ูˆูŽูŠูŽุญู’ุชูŽุงุฌููˆู†ูŽ ุฅูู„ูŽู‰ ู…ูŽู†ู’ ูŠูุฏูŽุงููุนู ุนูŽู†ู’ู‡ูู…ู’ุŒ ููŽุชูŽุจู’ุฏูŽุฃู ุงู„ุฏูŽู‘ูˆู’ู„ูŽุฉู ูููŠ ุงู„ูุงู†ู’ู‚ูุฑูŽุงุถู."

refrence audio

Further Fine-tuning

The model can be further fine-tuned for:

  • Non-diacritized text (requires additional training)
  • Specific voice characteristics
  • Domain-specific vocabulary
  • Dialectal variations

Fine-tuning infrastructure: Spark-TTS Fine-tune

License

This model is released under a Non-Commercial License.

  • You may use this model for research, educational, and personal non-commercial purposes.
  • Commercial use is strictly prohibited without explicit permission.
  • If you wish to use this model for commercial purposes, please contact the model author.

Acknowledgments

Limitations

  • Requires fully diacritized Arabic text as input
  • Optimized for Modern Standard Arabic (MSA), not dialectal Arabic
  • Performance may vary with very long texts without proper chunking
  • Voice cloning quality depends on reference audio quality and length
  • Generation speed scales with text length
Downloads last month
1,930
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for IbrahimSalah/Arabic-TTS-Spark

Finetuned
(17)
this model

Spaces using IbrahimSalah/Arabic-TTS-Spark 2