anaszil commited on
Commit
9240591
·
verified ·
1 Parent(s): c9eb957

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -7
README.md CHANGED
@@ -1,6 +1,13 @@
1
  ---
2
  base_model: openai/whisper-large-v3-turbo
3
  library_name: peft
 
 
 
 
 
 
 
4
  ---
5
  # 🗣️ Whisper Large v3 Turbo – Moroccan Darija (LoRA Fine-tuned)
6
 
@@ -84,15 +91,12 @@ print(output["text"])
84
  - **Optimizer:** AdamW
85
  - **Seed:** 42
86
  - **Training Time:** ~4.1 hours on 1 × H100 80 GB
87
- - **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2` |
88
  - **Rank (`r`):** 16
89
- - **Alpha (`lora_alpha`):** | 32 |
90
 
91
  ### 🎧 Dataset
92
  - **Data Source:** Private Moroccan Darija speech corpus (to be released soon).
93
- - **Segmentation:** All audio split into ≤30 s chunks.
94
-
95
- > The full dataset used to train this model will be open-sourced soon.
96
 
97
  ---
98
 
@@ -107,5 +111,4 @@ print(output["text"])
107
  Evaluation was performed on a held-out subset from the same data distribution.
108
  The model achieves a low CER but a relatively higher WER compared to other languages. This difference is mainly due to the absence of a standardized writing system for Darija in Morocco — many words can be spelled in several valid ways. This variability also reflects a limitation of the dataset used for fine-tuning and highlights the need to establish a consistent orthographic standard for Darija before large-scale data collection efforts.
109
 
110
- ---
111
-
 
1
  ---
2
  base_model: openai/whisper-large-v3-turbo
3
  library_name: peft
4
+ license: mit
5
+ language:
6
+ - ar
7
+ metrics:
8
+ - cer
9
+ - wer
10
+ pipeline_tag: automatic-speech-recognition
11
  ---
12
  # 🗣️ Whisper Large v3 Turbo – Moroccan Darija (LoRA Fine-tuned)
13
 
 
91
  - **Optimizer:** AdamW
92
  - **Seed:** 42
93
  - **Training Time:** ~4.1 hours on 1 × H100 80 GB
94
+ - **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2`
95
  - **Rank (`r`):** 16
96
+ - **Alpha (`lora_alpha`):** 32
97
 
98
  ### 🎧 Dataset
99
  - **Data Source:** Private Moroccan Darija speech corpus (to be released soon).
 
 
 
100
 
101
  ---
102
 
 
111
  Evaluation was performed on a held-out subset from the same data distribution.
112
  The model achieves a low CER but a relatively higher WER compared to other languages. This difference is mainly due to the absence of a standardized writing system for Darija in Morocco — many words can be spelled in several valid ways. This variability also reflects a limitation of the dataset used for fine-tuning and highlights the need to establish a consistent orthographic standard for Darija before large-scale data collection efforts.
113
 
114
+ ---