File size: 7,817 Bytes
b08c77e 994e0fc b08c77e 994e0fc b08c77e 994e0fc b08c77e 994e0fc b08c77e 994e0fc b08c77e 994e0fc b08c77e 994e0fc b08c77e 994e0fc b08c77e 994e0fc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
---
library_name: transformers
tags: []
---
# DeepAr
## Model Description
DeepAr is a state-of-the-art Arabic Automatic Speech Recognition (ASR) model based on whisper-turbo-v3 architecture. This model represents our latest and most advanced version, trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset for optimal performance.
**Key Features:**
- **High-fidelity transcription**: Transcribes exactly what is pronounced, maintaining authenticity of speech patterns
- **Speech improvement tool**: Designed to help users identify and correct speech patterns
- **Superior performance**: Outperforms many existing Arabic ASR models based on Whisper and its variants
- **Arabic with Tashkil**: Provides accurate diacritization for comprehensive Arabic text output
## What Makes DeepAr Different
Unlike traditional ASR models that normalize speech to standard text, DeepAr transcribes **exactly what is pronounced**. This unique approach makes it particularly valuable for:
- **Speech therapy and improvement**: Identifies pronunciation patterns and deviations
- **Language learning**: Helps learners understand their actual pronunciation vs. intended speech
- **Linguistic research**: Captures authentic speech patterns for analysis
- **Pronunciation assessment**: Provides detailed feedback on spoken Arabic
## Model Details
- **Base Architecture**: whisper-turbo-v3
- **Language**: Arabic (with Tashkil/diacritics)
- **Task**: High-fidelity Automatic Speech Recognition
- **Training Data**: Complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset
- **Model Type**: Production-ready, latest version
## Performance
DeepAr demonstrates superior performance compared to many Arabic ASR models built on Whisper and its variants, particularly excelling in:
- Pronunciation accuracy detection
- Diacritic prediction
- Handling of Arabic speech variations
- Authentic speech pattern recognition
## Intended Use
This model is ideal for:
- Speech therapy and pronunciation correction applications
- Arabic language learning platforms
- Linguistic research and analysis
- Educational tools for speech improvement
- Applications requiring authentic speech transcription
- Quality assessment of spoken Arabic
## Usage
### Installation
```bash
pip install transformers torch torchaudio
```
### Quick Start
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio
# Load model and processor
processor = WhisperProcessor.from_pretrained("CUAIStudents/DeepAr")
model = WhisperForConditionalGeneration.from_pretrained("CUAIStudents/DeepAr")
# Load and preprocess audio
audio_path = "path_to_your_arabic_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)
# Resample to 16kHz if necessary
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
# Process audio
input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features
# Generate transcription
with torch.no_grad():
predicted_ids = model.generate(input_features, language="ar")
# Decode transcription (exactly as pronounced)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Pronounced as: {transcription}")
```
### Speech Analysis Example
```python
def analyze_pronunciation(audio_path, target_text=None):
"""
Analyze pronunciation and compare with target text if provided
"""
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
predicted_ids = model.generate(input_features, language="ar")
actual_pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Actual pronunciation: {actual_pronunciation}")
if target_text:
print(f"Target text: {target_text}")
print("Analysis: Compare the differences for speech improvement")
return actual_pronunciation
# Example usage
pronunciation = analyze_pronunciation("student_reading.wav", "النص المطلوب قراءته")
```
### Batch Processing for Speech Assessment
```python
def assess_multiple_recordings(audio_files, target_texts=None):
"""
Process multiple recordings for comprehensive speech assessment
"""
results = []
for i, audio_file in enumerate(audio_files):
waveform, sample_rate = torchaudio.load(audio_file)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
waveform = resampler(waveform)
input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
predicted_ids = model.generate(input_features, language="ar")
pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
result = {
'file': audio_file,
'pronunciation': pronunciation,
'target': target_texts[i] if target_texts else None
}
results.append(result)
print(f"File {i+1}: {pronunciation}")
return results
# Example usage
audio_files = ["recording1.wav", "recording2.wav", "recording3.wav"]
target_texts = ["النص الأول", "النص الثاني", "النص الثالث"]
assessment_results = assess_multiple_recordings(audio_files, target_texts)
```
## Training Data
This model was trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset, utilizing the full scope of available Arabic speech data with corresponding high-quality transcriptions including diacritics.
## Model Advantages
- **Authentic transcription**: Captures exactly what is spoken, not what should be spoken
- **High accuracy**: Superior performance compared to similar Whisper-based Arabic models
- **Comprehensive training**: Utilizes the complete dataset for optimal coverage
- **Practical applications**: Specifically designed for speech improvement and assessment
- **Diacritic accuracy**: Excellent performance in Arabic diacritization
## Limitations
- **MSA focus**: Optimized primarily for Modern Standard Arabic (MSA) rather than dialectal variations
## License
This model is released under the MIT License.
```
MIT License
Copyright (c) 2024 CUAIStudents
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
```
|