--- library_name: transformers tags: [] --- # DeepAr ## Model Description DeepAr is a state-of-the-art Arabic Automatic Speech Recognition (ASR) model based on whisper-turbo-v3 architecture. This model represents our latest and most advanced version, trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset for optimal performance. **Key Features:** - **High-fidelity transcription**: Transcribes exactly what is pronounced, maintaining authenticity of speech patterns - **Speech improvement tool**: Designed to help users identify and correct speech patterns - **Superior performance**: Outperforms many existing Arabic ASR models based on Whisper and its variants - **Arabic with Tashkil**: Provides accurate diacritization for comprehensive Arabic text output ## What Makes DeepAr Different Unlike traditional ASR models that normalize speech to standard text, DeepAr transcribes **exactly what is pronounced**. This unique approach makes it particularly valuable for: - **Speech therapy and improvement**: Identifies pronunciation patterns and deviations - **Language learning**: Helps learners understand their actual pronunciation vs. intended speech - **Linguistic research**: Captures authentic speech patterns for analysis - **Pronunciation assessment**: Provides detailed feedback on spoken Arabic ## Model Details - **Base Architecture**: whisper-turbo-v3 - **Language**: Arabic (with Tashkil/diacritics) - **Task**: High-fidelity Automatic Speech Recognition - **Training Data**: Complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset - **Model Type**: Production-ready, latest version ## Performance DeepAr demonstrates superior performance compared to many Arabic ASR models built on Whisper and its variants, particularly excelling in: - Pronunciation accuracy detection - Diacritic prediction - Handling of Arabic speech variations - Authentic speech pattern recognition ## Intended Use This model is ideal for: - Speech therapy and pronunciation correction applications - Arabic language learning platforms - Linguistic research and analysis - Educational tools for speech improvement - Applications requiring authentic speech transcription - Quality assessment of spoken Arabic ## Usage ### Installation ```bash pip install transformers torch torchaudio ``` ### Quick Start ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch import torchaudio # Load model and processor processor = WhisperProcessor.from_pretrained("CUAIStudents/DeepAr") model = WhisperForConditionalGeneration.from_pretrained("CUAIStudents/DeepAr") # Load and preprocess audio audio_path = "path_to_your_arabic_audio.wav" waveform, sample_rate = torchaudio.load(audio_path) # Resample to 16kHz if necessary if sample_rate != 16000: resampler = torchaudio.transforms.Resample(sample_rate, 16000) waveform = resampler(waveform) # Process audio input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features # Generate transcription with torch.no_grad(): predicted_ids = model.generate(input_features, language="ar") # Decode transcription (exactly as pronounced) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] print(f"Pronounced as: {transcription}") ``` ### Speech Analysis Example ```python def analyze_pronunciation(audio_path, target_text=None): """ Analyze pronunciation and compare with target text if provided """ waveform, sample_rate = torchaudio.load(audio_path) if sample_rate != 16000: resampler = torchaudio.transforms.Resample(sample_rate, 16000) waveform = resampler(waveform) input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features with torch.no_grad(): predicted_ids = model.generate(input_features, language="ar") actual_pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] print(f"Actual pronunciation: {actual_pronunciation}") if target_text: print(f"Target text: {target_text}") print("Analysis: Compare the differences for speech improvement") return actual_pronunciation # Example usage pronunciation = analyze_pronunciation("student_reading.wav", "النص المطلوب قراءته") ``` ### Batch Processing for Speech Assessment ```python def assess_multiple_recordings(audio_files, target_texts=None): """ Process multiple recordings for comprehensive speech assessment """ results = [] for i, audio_file in enumerate(audio_files): waveform, sample_rate = torchaudio.load(audio_file) if sample_rate != 16000: resampler = torchaudio.transforms.Resample(sample_rate, 16000) waveform = resampler(waveform) input_features = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features with torch.no_grad(): predicted_ids = model.generate(input_features, language="ar") pronunciation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] result = { 'file': audio_file, 'pronunciation': pronunciation, 'target': target_texts[i] if target_texts else None } results.append(result) print(f"File {i+1}: {pronunciation}") return results # Example usage audio_files = ["recording1.wav", "recording2.wav", "recording3.wav"] target_texts = ["النص الأول", "النص الثاني", "النص الثالث"] assessment_results = assess_multiple_recordings(audio_files, target_texts) ``` ## Training Data This model was trained on the complete [CUAIStudents/Ar-ASR](https://huggingface.co/datasets/CUAIStudents/Ar-ASR) dataset, utilizing the full scope of available Arabic speech data with corresponding high-quality transcriptions including diacritics. ## Model Advantages - **Authentic transcription**: Captures exactly what is spoken, not what should be spoken - **High accuracy**: Superior performance compared to similar Whisper-based Arabic models - **Comprehensive training**: Utilizes the complete dataset for optimal coverage - **Practical applications**: Specifically designed for speech improvement and assessment - **Diacritic accuracy**: Excellent performance in Arabic diacritization ## Limitations - **MSA focus**: Optimized primarily for Modern Standard Arabic (MSA) rather than dialectal variations ## License This model is released under the MIT License. ``` MIT License Copyright (c) 2024 CUAIStudents Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ```