Spark-TTS Arabic

نموذج تحويل النص إلى كلام باللغة العربية

Arabic text-to-speech model fine-tuned on 300 hours of clean Arabic audio data. Delivers consistent, high-quality speech synthesis for Modern Standard Arabic with full diacritization.

Model Details

Training Data: ~300 hours of clean Arabic audio
Language: Modern Standard Arabic (MSA)
Sample Rate: 24kHz

Usage

Quick Start

see the Colab notebook. HF space : Arabic Spark TTS Space.

from transformers import AutoProcessor, AutoModel
import soundfile as sf
import torch

# Load model
model_id = "IbrahimSalah/Arabic-TTS-Spark"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval().to(device)

# Prepare inputs
inputs = processor(
    text="YOUR_TEXT_WITH_TASHKEEL",
    prompt_speech_path="path/to/reference.wav",
    prompt_text="REFERENCE_TEXT_WITH_TASHKEEL",
    return_tensors="pt"
).to(device)

# Generate
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=8000, temperature=0.8)

# Decode
output = processor.decode(generated_ids=output_ids)
sf.write("output.wav", output["audio"], output["sampling_rate"])

Key Features

High-quality Arabic speech synthesis with natural prosody
Efficient voice cloning from reference audio
Advanced text chunking for long-form content
Built-in audio post-processing (normalization, silence removal, crossfading)
Works best with moderate text lengths
Adjustable generation parameters (temperature, top_k, top_p)

Input Requirements

Critical: Text must include full Arabic diacritization (tashkeel). The model is trained exclusively on fully diacritized text and will not perform well on non-diacritized input.

Example of correct input:

إِنَّ الْعِلْمَ نُورٌ يُقْذَفُ فِي الْقَلْبِ

Generation Parameters

tts.generate_long_text(
    text=your_text,
    prompt_audio_path="reference.wav",
    prompt_transcript="reference_text",
    output_path="output.wav",
    max_chunk_length=300,        # Characters per chunk
    crossfade_duration=0.08,     # Crossfade duration in seconds
    normalize_audio_flag=True,
    remove_silence_flag=True,
    temperature=0.8,             # Generation randomness
    top_p=0.95,                  # Nucleus sampling
    top_k=50                     # Top-k sampling
)

Sample Output

Text: "إِنَّ الدَّوْلَةَ لَهَا أَعْمَارٌ طَبِيعِيَّةٌ كَمَا لِلْأَشْخَاصِ. وَأَنَّهَا تَنْتَقِلُ فِي أَطْوَارٍ مُخْتَلِفَةٍ، فَيَكُونُ الْجِيلُ الْأَوَّلُ مِنْ أَهْلِ الدَّوْلَةِ، قَدْ حَافَظُوا عَلَى الْخُشُونَةِ الْبَدَوِيَّةِ، وَالتَّوَحُّشِ، وَالشَّظَفِ، وَالْبَأْسِ، وَالِاشْتِرَاكِ فِي الْمَجْدِ. فَتَكُونُ حُدُودُهُمْ مَرْهُوبَةً، وَجَوَانِبُهُمْ مُعَزَّزَةً. ثُمَّ يَأْتِي الْجِيلُ الثَّانِي، فَيَتَحَوَّلُ حَالُهُمْ بِالْمُلْكِ وَالتَّرَفِ مِنَ الْبَدَاوَةِ إِلَى الْحَضَارَةِ، وَمِنَ الْخُشُونَةِ إِلَى التَّرَفِ. فَيَنْكَسِرُ سَوْرَةُ الْعَصَبِيَّةِ قَلِيلًا. ثُمَّ يَأْتِي الْجِيلُ الثَّالِثُ، فَيَكُونُونَ قَدْ نَسُوا عَهْدَ الْبَدَاوَةِ وَالْخُشُونَةِ، وَيَنْغَمِسُونَ فِي النَّعِيمِ وَالتَّرَفِ، وَيَصِيرُونَ عِيَالًا عَلَى الدَّوْلَةِ. فَيَسْقُطُونَ فِي الْهَرَمِ وَالزَّوَالِ، وَيَحْتَاجُونَ إِلَى مَنْ يُدَافِعُ عَنْهُمْ، فَتَبْدَأُ الدَّوْلَةُ فِي الِانْقِرَاضِ."

refrence audio

Further Fine-tuning

The model can be further fine-tuned for:

Non-diacritized text (requires additional training)
Specific voice characteristics
Domain-specific vocabulary
Dialectal variations

Fine-tuning infrastructure: Spark-TTS Fine-tune

License

This model is released under a Non-Commercial License.

You may use this model for research, educational, and personal non-commercial purposes.
Commercial use is strictly prohibited without explicit permission.
If you wish to use this model for commercial purposes, please contact the model author.

Acknowledgments

Base model: Spark-TTS by tuan12378

Limitations

Requires fully diacritized Arabic text as input
Optimized for Modern Standard Arabic (MSA), not dialectal Arabic
Performance may vary with very long texts without proper chunking
Voice cloning quality depends on reference audio quality and length
Generation speed scales with text length

Downloads last month: 1,930

Model tree for IbrahimSalah/Arabic-TTS-Spark

Base model

SparkAudio/Spark-TTS-0.5B

Finetuned

(17)

this model

IbrahimSalah
/

Arabic-TTS-Spark