Push model using huggingface_hub.
Browse files- .gitattributes +1 -0
- README.md +143 -0
- soloba-ctc-0.6b-v2.nemo +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
soloba-ctc-0.6b-v2.nemo filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- bm
|
| 4 |
+
library_name: nemo
|
| 5 |
+
datasets:
|
| 6 |
+
- RobotsMali/afvoices
|
| 7 |
+
|
| 8 |
+
thumbnail: null
|
| 9 |
+
tags:
|
| 10 |
+
- automatic-speech-recognition
|
| 11 |
+
- speech
|
| 12 |
+
- audio
|
| 13 |
+
- Transducer
|
| 14 |
+
- FastConformer
|
| 15 |
+
- Conformer
|
| 16 |
+
- pytorch
|
| 17 |
+
- Bambara
|
| 18 |
+
- NeMo
|
| 19 |
+
license: cc-by-4.0
|
| 20 |
+
base_model: RobotsMali/soloba-ctc-0.6b-v0
|
| 21 |
+
model-index:
|
| 22 |
+
- name: soloba-ctc-0.6b-v2
|
| 23 |
+
results:
|
| 24 |
+
- task:
|
| 25 |
+
name: Automatic Speech Recognition
|
| 26 |
+
type: automatic-speech-recognition
|
| 27 |
+
dataset:
|
| 28 |
+
name: African Next Voices
|
| 29 |
+
type: RobotsMali/afvoices
|
| 30 |
+
split: test
|
| 31 |
+
args:
|
| 32 |
+
language: bm
|
| 33 |
+
metrics:
|
| 34 |
+
- name: Test WER
|
| 35 |
+
type: wer
|
| 36 |
+
value: 30.85416567085703
|
| 37 |
+
- name: Test CER
|
| 38 |
+
type: cer
|
| 39 |
+
value: 14.448940587985465
|
| 40 |
+
- task:
|
| 41 |
+
name: Automatic Speech Recognition
|
| 42 |
+
type: automatic-speech-recognition
|
| 43 |
+
dataset:
|
| 44 |
+
name: Nyana Eval
|
| 45 |
+
type: RobotsMali/nyana-eval
|
| 46 |
+
split: test
|
| 47 |
+
args:
|
| 48 |
+
language: bm
|
| 49 |
+
metrics:
|
| 50 |
+
- name: Test WER
|
| 51 |
+
type: wer
|
| 52 |
+
value: XX.XXX
|
| 53 |
+
- name: Test CER
|
| 54 |
+
type: cer
|
| 55 |
+
value: YY.YYY
|
| 56 |
+
|
| 57 |
+
metrics:
|
| 58 |
+
- wer
|
| 59 |
+
- cer
|
| 60 |
+
pipeline_tag: automatic-speech-recognition
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
# Soloba-CTC-600M Series
|
| 64 |
+
|
| 65 |
+
<style>
|
| 66 |
+
img {
|
| 67 |
+
display: inline;
|
| 68 |
+
}
|
| 69 |
+
</style>
|
| 70 |
+
|
| 71 |
+
[](#model-architecture)
|
| 72 |
+
| [](#model-architecture)
|
| 73 |
+
| [](#datasets)
|
| 74 |
+
|
| 75 |
+
`soloba-ctc-0.6b-v2` is a fine tuned version of [`RobotsMali/soloba-ctc-0.6b-v0`](https://huggingface.co/RobotsMali/soloba-ctc-0.6b-v0) on the African Next Voices dataset (ANV). This model does not consistently produce Capitalizations and Punctuations and it cannot produce acoustic event tags like those found in the ANV dataset in its transcriptions. It was fine-tuned using **NVIDIA NeMo**.
|
| 76 |
+
|
| 77 |
+
## **🚨 Important Note**
|
| 78 |
+
This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. Users should be aware that:
|
| 79 |
+
|
| 80 |
+
- **The model may not generalize very well accross all speaking conditions and dialects.**
|
| 81 |
+
- **Community feedback is welcome, and contributions are encouraged to refine the model further.**
|
| 82 |
+
|
| 83 |
+
## NVIDIA NeMo: Training
|
| 84 |
+
|
| 85 |
+
To fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|
| 86 |
+
|
| 87 |
+
```bash
|
| 88 |
+
pip install nemo-toolkit['asr']
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## How to Use This Model
|
| 92 |
+
|
| 93 |
+
Note that this model has been released for research purposes primarily.
|
| 94 |
+
|
| 95 |
+
### Load Model with NeMo
|
| 96 |
+
```python
|
| 97 |
+
import nemo.collections.asr as nemo_asr
|
| 98 |
+
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-ctc-0.6b-v2")
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
### Transcribe Audio
|
| 102 |
+
```python
|
| 103 |
+
model.eval()
|
| 104 |
+
# Assuming you have a test audio file named sample_audio.wav
|
| 105 |
+
asr_model.transcribe(['sample_audio.wav'])
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### Input
|
| 109 |
+
|
| 110 |
+
This model accepts any **mono-channel audio (wav files)** as input and resamples them to *16 kHz sample rate* before performing the forward pass
|
| 111 |
+
|
| 112 |
+
### Output
|
| 113 |
+
|
| 114 |
+
This model provides transcribed speech as an hypothesis object with a text attribute containing the transcription string for a given speech sample. (nemo>=2.3)
|
| 115 |
+
|
| 116 |
+
## Model Architecture
|
| 117 |
+
|
| 118 |
+
This model uses a FastConformer Ecoder and a Convolutional decoder with CTC Loss. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
|
| 119 |
+
|
| 120 |
+
## Training
|
| 121 |
+
|
| 122 |
+
The NeMo toolkit was used for finetuning this model for **165,247 steps** over `RobotsMali/soloba-ctc-0.6b-v0` model. The finetuning codes and configurations can be found at [RobotsMali-AI/bambara-asr](https://github.com/RobotsMali-AI/bambara-asr/).
|
| 123 |
+
|
| 124 |
+
The tokenizer for this model was trained on the text transcripts of the train set of RobotsMali/afvoices using this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
| 125 |
+
|
| 126 |
+
## Dataset
|
| 127 |
+
This model was fine-tuned on a 100 hours pre-completion subset of the [African Next Voices](https://huggingface.co/datasets/RobotsMali/afvoices) dataset. You can reconstitute that subset with these [manifest files](https://github.com/RobotsMali-AI/bambara-asr/afvoices/pre-manifests).
|
| 128 |
+
|
| 129 |
+
## Performance
|
| 130 |
+
|
| 131 |
+
We report the Word Error Rate (WER) and Character Error Rate (CER) for this model:
|
| 132 |
+
|
| 133 |
+
| Benchmark | Decoding | WER (%) ↓ | CER (%) ↓ |
|
| 134 |
+
|---------------|----------|-----------------|-----------------|
|
| 135 |
+
| African Next Voices (afvoices) | CTC | 30.85 | 14.44 |
|
| 136 |
+
| Nyana Eval | CTC | XX.XX | YY.YY |
|
| 137 |
+
|
| 138 |
+
## License
|
| 139 |
+
This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license.
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/RobotsMali-AI/bambara-asr/issues) on GitHub for help or contributions.
|
soloba-ctc-0.6b-v2.nemo
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3a14ea78886f76ce7384a3e1ccc6b096f2c9b99c94fa2837f10415671305c8da
|
| 3 |
+
size 2434017280
|