Spark-TTS Arabic
ูู ูุฐุฌ ุชุญููู ุงููุต ุฅูู ููุงู ุจุงููุบุฉ ุงูุนุฑุจูุฉ
Arabic text-to-speech model fine-tuned on 300 hours of clean Arabic audio data. Delivers consistent, high-quality speech synthesis for Modern Standard Arabic with full diacritization.
Model Details
Training Data: ~300 hours of clean Arabic audio
Language: Modern Standard Arabic (MSA)
Sample Rate: 24kHz
Usage
Quick Start
see the Colab notebook. HF space : Arabic Spark TTS Space.
from transformers import AutoProcessor, AutoModel
import soundfile as sf
import torch
# Load model
model_id = "IbrahimSalah/Arabic-TTS-Spark"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval().to(device)
# Prepare inputs
inputs = processor(
text="YOUR_TEXT_WITH_TASHKEEL",
prompt_speech_path="path/to/reference.wav",
prompt_text="REFERENCE_TEXT_WITH_TASHKEEL",
return_tensors="pt"
).to(device)
# Generate
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=8000, temperature=0.8)
# Decode
output = processor.decode(generated_ids=output_ids)
sf.write("output.wav", output["audio"], output["sampling_rate"])
Key Features
- High-quality Arabic speech synthesis with natural prosody
- Efficient voice cloning from reference audio
- Advanced text chunking for long-form content
- Built-in audio post-processing (normalization, silence removal, crossfading)
- Works best with moderate text lengths
- Adjustable generation parameters (temperature, top_k, top_p)
Input Requirements
Critical: Text must include full Arabic diacritization (tashkeel). The model is trained exclusively on fully diacritized text and will not perform well on non-diacritized input.
Example of correct input:
ุฅูููู ุงููุนูููู
ู ูููุฑู ููููุฐููู ููู ุงููููููุจู
Generation Parameters
tts.generate_long_text(
text=your_text,
prompt_audio_path="reference.wav",
prompt_transcript="reference_text",
output_path="output.wav",
max_chunk_length=300, # Characters per chunk
crossfade_duration=0.08, # Crossfade duration in seconds
normalize_audio_flag=True,
remove_silence_flag=True,
temperature=0.8, # Generation randomness
top_p=0.95, # Nucleus sampling
top_k=50 # Top-k sampling
)
Sample Output
Text: "ุฅูููู ุงูุฏููููููุฉู ููููุง ุฃูุนูู ูุงุฑู ุทูุจููุนููููุฉู ููู ูุง ููููุฃูุดูุฎูุงุตู. ููุฃููููููุง ุชูููุชููููู ููู ุฃูุทูููุงุฑู ู ูุฎูุชูููููุฉูุ ููููููููู ุงููุฌูููู ุงููุฃูููููู ู ููู ุฃููููู ุงูุฏููููููุฉูุ ููุฏู ุญูุงููุธููุง ุนูููู ุงููุฎูุดููููุฉู ุงููุจูุฏููููููุฉูุ ููุงูุชููููุญููุดูุ ููุงูุดููุธูููุ ููุงููุจูุฃูุณูุ ููุงููุงุดูุชูุฑูุงูู ููู ุงููู ูุฌูุฏู. ููุชูููููู ุญูุฏููุฏูููู ู ู ูุฑููููุจูุฉูุ ููุฌูููุงููุจูููู ู ู ูุนูุฒููุฒูุฉู. ุซูู ูู ููุฃูุชูู ุงููุฌูููู ุงูุซููุงูููุ ููููุชูุญูููููู ุญูุงููููู ู ุจูุงููู ููููู ููุงูุชููุฑููู ู ููู ุงููุจูุฏูุงููุฉู ุฅูููู ุงููุญูุถูุงุฑูุฉูุ ููู ููู ุงููุฎูุดููููุฉู ุฅูููู ุงูุชููุฑููู. ููููููููุณูุฑู ุณูููุฑูุฉู ุงููุนูุตูุจููููุฉู ูููููููุง. ุซูู ูู ููุฃูุชูู ุงููุฌูููู ุงูุซููุงููุซูุ ูููููููููููู ููุฏู ููุณููุง ุนูููุฏู ุงููุจูุฏูุงููุฉู ููุงููุฎูุดููููุฉูุ ููููููุบูู ูุณูููู ููู ุงููููุนููู ู ููุงูุชููุฑูููุ ููููุตููุฑูููู ุนูููุงููุง ุนูููู ุงูุฏููููููุฉู. ููููุณูููุทูููู ููู ุงููููุฑูู ู ููุงูุฒููููุงููุ ููููุญูุชูุงุฌูููู ุฅูููู ู ููู ููุฏูุงููุนู ุนูููููู ูุ ููุชูุจูุฏูุฃู ุงูุฏููููููุฉู ููู ุงููุงููููุฑูุงุถู."
refrence audio
Further Fine-tuning
The model can be further fine-tuned for:
- Non-diacritized text (requires additional training)
- Specific voice characteristics
- Domain-specific vocabulary
- Dialectal variations
Fine-tuning infrastructure: Spark-TTS Fine-tune
License
This model is released under a Non-Commercial License.
- You may use this model for research, educational, and personal non-commercial purposes.
- Commercial use is strictly prohibited without explicit permission.
- If you wish to use this model for commercial purposes, please contact the model author.
Acknowledgments
- Base model: Spark-TTS by tuan12378
Limitations
- Requires fully diacritized Arabic text as input
- Optimized for Modern Standard Arabic (MSA), not dialectal Arabic
- Performance may vary with very long texts without proper chunking
- Voice cloning quality depends on reference audio quality and length
- Generation speed scales with text length
- Downloads last month
- 1,930
Model tree for IbrahimSalah/Arabic-TTS-Spark
Base model
SparkAudio/Spark-TTS-0.5B