Update README and model file

Browse files

Files changed (7) hide show

4gram.zip +2 -2
README.md +38 -21
example.mp3 +0 -0
example.wav +0 -0
example2.mp3 +0 -0
hyperparams.yaml +1 -1
model.ckpt +1 -1

4gram.zip CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b44b6f17af4baf24dcd93f2d411d1664e32b1c3cdd2a1f07458c53fd02e6f487
-size 2773083070

 version https://git-lfs.github.com/spec/v1
+oid sha256:e6e5c67796f2399c116073286a0870f141b4ddf1b6a75723c139c77d21114d55
+size 2481196955

README.md CHANGED Viewed

@@ -13,12 +13,10 @@ tags:
 - Transformer
 license: cc-by-nc-4.0
 widget:
-- example_title: VLSP ASR 2020 test T1
-  src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_0001-00010.wav
-- example_title: VLSP ASR 2020 test T1
-  src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_utt000000042.wav
-- example_title: VLSP ASR 2020 test T2
-  src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav
 model-index:
 - name: Wav2vec2 Base Vietnamese 270h
   results:
@@ -33,6 +31,28 @@ model-index:
        - name: Test WER
          type: wer
          value: 9.66
   - task:
       name: Speech Recognition
       type: automatic-speech-recognition
@@ -43,7 +63,7 @@ model-index:
     metrics:
        - name: Test WER
          type: wer
-         value: 4.04
 ---
 # Wav2Vec2-Base-Vietnamese-270h
 Fine-tuned Wav2Vec2 model on Vietnamese Speech Recognition task using about 270h labelled data combined from multiple datasets including [Common Voice](https://huggingface.co/datasets/common_voice), [VIVOS](https://huggingface.co/datasets/vivos), [VLSP2020](https://vlsp.org.vn/vlsp2020/eval/asr). The model was fine-tuned using SpeechBrain toolkit with a custom tokenizer. For a better experience, we encourage you to learn more about [SpeechBrain](https://speechbrain.github.io/).
@@ -51,19 +71,15 @@ When using this model, make sure that your speech input is sampled at 16kHz.
 Please refer to [huggingface blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) or [speechbrain](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC) on how to fine-tune Wav2Vec2 model on a specific language.
 ### Benchmark WER result:
-| | [VIVOS](https://huggingface.co/datasets/vivos) | [COMMON VOICE VI](https://huggingface.co/datasets/common_voice) |
-|---|---|---|
-|without LM| 8.41 | 17.82 |
-|with 4-grams LM| 4.04 | 9.66 |
 The language model was trained using [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) dataset on about 32GB of crawled text.
 ### Install SpeechBrain
-To use this model, you should install speechbrain from source. This is not required for speechbrain version > 0.5.10
-```bash
-pip install git+https://github.com/speechbrain/speechbrain.git@develop
-```
 ### Usage
 The model can be used directly (without a language model) as follows:
@@ -71,14 +87,15 @@ The model can be used directly (without a language model) as follows:
 from speechbrain.pretrained import EncoderASR
 model = EncoderASR.from_hparams(source="dragonSwing/wav2vec2-base-vn-270h", savedir="pretrained_models/asr-wav2vec2-vi")
-model.transcribe_file('dragonSwing/wav2vec2-base-vn-270h/example.wav')
 ```
 ### Inference on GPU
 To perform inference on the GPU, add  `run_opts={"device":"cuda"}`  when calling the `from_hparams` method.
 ### Evaluation
-The model can be evaluated as follows on the Vietnamese test data of Common Voice.
 ```python
 import torch
 import torchaudio
@@ -86,7 +103,7 @@ from datasets import load_dataset, load_metric, Audio
 from transformers import Wav2Vec2FeatureExtractor
 from speechbrain.pretrained import EncoderASR
 import re
-test_dataset = load_dataset("common_voice", "vi", split="test")
 test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16_000))
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 wer = load_metric("wer")
@@ -116,10 +133,10 @@ def evaluate(batch):
   batch["pred_strings"] = pred_str
   return batch
-result = test_dataset.map(evaluate, batched=True, batch_size=4)
 print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["target_text"])))
 ```
-**Test Result**: 17.817680%
 #### Citation
 ```

 - Transformer
 license: cc-by-nc-4.0
 widget:
+- example_title: Example 1
+  src: https://huggingface.co/dragonSwing/wav2vec2-base-vn-270h/raw/main/example.mp3
+- example_title: Example 2
+  src: https://huggingface.co/dragonSwing/wav2vec2-base-vn-270h/raw/main/example2.mp3
 model-index:
 - name: Wav2vec2 Base Vietnamese 270h
   results:
        - name: Test WER
          type: wer
          value: 9.66
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 7.0
+      type: mozilla-foundation/common_voice_7_0
+      args: vi
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 5.57
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: Common Voice 8.0
+      type: mozilla-foundation/common_voice_8_0
+      args: vi
+    metrics:
+       - name: Test WER
+         type: wer
+         value: 5.76
   - task:
       name: Speech Recognition
       type: automatic-speech-recognition
     metrics:
        - name: Test WER
          type: wer
+         value: 3.70
 ---
 # Wav2Vec2-Base-Vietnamese-270h
 Fine-tuned Wav2Vec2 model on Vietnamese Speech Recognition task using about 270h labelled data combined from multiple datasets including [Common Voice](https://huggingface.co/datasets/common_voice), [VIVOS](https://huggingface.co/datasets/vivos), [VLSP2020](https://vlsp.org.vn/vlsp2020/eval/asr). The model was fine-tuned using SpeechBrain toolkit with a custom tokenizer. For a better experience, we encourage you to learn more about [SpeechBrain](https://speechbrain.github.io/).
 Please refer to [huggingface blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) or [speechbrain](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC) on how to fine-tune Wav2Vec2 model on a specific language.
 ### Benchmark WER result:
+| | [VIVOS](https://huggingface.co/datasets/vivos) | [COMMON VOICE 7.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_7_0) | [COMMON VOICE 8.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) |
+|---|---|---|---|
+|without LM| 8.23 | 12.15 | 12.15 |
+|with 4-grams LM| 3.70 | 5.57 | 5.76 |
 The language model was trained using [OSCAR](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) dataset on about 32GB of crawled text.
 ### Install SpeechBrain
+To use this model, you should install speechbrain > 0.5.10
 ### Usage
 The model can be used directly (without a language model) as follows:
 from speechbrain.pretrained import EncoderASR
 model = EncoderASR.from_hparams(source="dragonSwing/wav2vec2-base-vn-270h", savedir="pretrained_models/asr-wav2vec2-vi")
+model.transcribe_file('dragonSwing/wav2vec2-base-vn-270h/example.mp3')
+# Output: được hồ chí minh coi là một động lực lớn của sự phát triển đất nước
 ```
 ### Inference on GPU
 To perform inference on the GPU, add  `run_opts={"device":"cuda"}`  when calling the `from_hparams` method.
 ### Evaluation
+The model can be evaluated as follows on the Vietnamese test data of Common Voice 8.0.
 ```python
 import torch
 import torchaudio
 from transformers import Wav2Vec2FeatureExtractor
 from speechbrain.pretrained import EncoderASR
 import re
+test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "vi", split="test", use_auth_token=True)
 test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16_000))
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 wer = load_metric("wer")
   batch["pred_strings"] = pred_str
   return batch
+result = test_dataset.map(evaluate, batched=True, batch_size=1)
 print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["target_text"])))
 ```
+**Test Result**: 12.155553%
 #### Citation
 ```

example.mp3 ADDED Viewed

Binary file (11.8 kB). View file

example.wav DELETED Viewed

Binary file (49.6 kB)

example2.mp3 ADDED Viewed

Binary file (10.5 kB). View file

hyperparams.yaml CHANGED Viewed

@@ -1,5 +1,5 @@
 # ################################
-# Model: wav2vec2 + DNN + CTC/Attention
 # Augmentation: SpecAugment
 # Authors: Le Do Thanh Binh 2021
 # ################################

 # ################################
+# Model: wav2vec2 + CTC
 # Augmentation: SpecAugment
 # Authors: Le Do Thanh Binh 2021
 # ################################

model.ckpt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e315a64b704fff992630eccd824c2780ec79c346b2c64518ee9b7845af03a65c
 size 379749523

 version https://git-lfs.github.com/spec/v1
+oid sha256:8f28211bbcf163899adc748d90c1b40b481a6c785b1e71785f90e7e2a95c8e78
 size 379749523