Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,13 @@
|
|
| 1 |
---
|
| 2 |
base_model: openai/whisper-large-v3-turbo
|
| 3 |
library_name: peft
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
# 🗣️ Whisper Large v3 Turbo – Moroccan Darija (LoRA Fine-tuned)
|
| 6 |
|
|
@@ -84,15 +91,12 @@ print(output["text"])
|
|
| 84 |
- **Optimizer:** AdamW
|
| 85 |
- **Seed:** 42
|
| 86 |
- **Training Time:** ~4.1 hours on 1 × H100 80 GB
|
| 87 |
-
- **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2`
|
| 88 |
- **Rank (`r`):** 16
|
| 89 |
-
- **Alpha (`lora_alpha`):**
|
| 90 |
|
| 91 |
### 🎧 Dataset
|
| 92 |
- **Data Source:** Private Moroccan Darija speech corpus (to be released soon).
|
| 93 |
-
- **Segmentation:** All audio split into ≤30 s chunks.
|
| 94 |
-
|
| 95 |
-
> The full dataset used to train this model will be open-sourced soon.
|
| 96 |
|
| 97 |
---
|
| 98 |
|
|
@@ -107,5 +111,4 @@ print(output["text"])
|
|
| 107 |
Evaluation was performed on a held-out subset from the same data distribution.
|
| 108 |
The model achieves a low CER but a relatively higher WER compared to other languages. This difference is mainly due to the absence of a standardized writing system for Darija in Morocco — many words can be spelled in several valid ways. This variability also reflects a limitation of the dataset used for fine-tuning and highlights the need to establish a consistent orthographic standard for Darija before large-scale data collection efforts.
|
| 109 |
|
| 110 |
-
---
|
| 111 |
-
|
|
|
|
| 1 |
---
|
| 2 |
base_model: openai/whisper-large-v3-turbo
|
| 3 |
library_name: peft
|
| 4 |
+
license: mit
|
| 5 |
+
language:
|
| 6 |
+
- ar
|
| 7 |
+
metrics:
|
| 8 |
+
- cer
|
| 9 |
+
- wer
|
| 10 |
+
pipeline_tag: automatic-speech-recognition
|
| 11 |
---
|
| 12 |
# 🗣️ Whisper Large v3 Turbo – Moroccan Darija (LoRA Fine-tuned)
|
| 13 |
|
|
|
|
| 91 |
- **Optimizer:** AdamW
|
| 92 |
- **Seed:** 42
|
| 93 |
- **Training Time:** ~4.1 hours on 1 × H100 80 GB
|
| 94 |
+
- **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2`
|
| 95 |
- **Rank (`r`):** 16
|
| 96 |
+
- **Alpha (`lora_alpha`):** 32
|
| 97 |
|
| 98 |
### 🎧 Dataset
|
| 99 |
- **Data Source:** Private Moroccan Darija speech corpus (to be released soon).
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
---
|
| 102 |
|
|
|
|
| 111 |
Evaluation was performed on a held-out subset from the same data distribution.
|
| 112 |
The model achieves a low CER but a relatively higher WER compared to other languages. This difference is mainly due to the absence of a standardized writing system for Darija in Morocco — many words can be spelled in several valid ways. This variability also reflects a limitation of the dataset used for fine-tuning and highlights the need to establish a consistent orthographic standard for Darija before large-scale data collection efforts.
|
| 113 |
|
| 114 |
+
---
|
|
|