Spaces:
Runtime error
Runtime error
| # wav2vec 2.0 | |
| wav2vec 2.0 learns speech representations on unlabeled data as described in [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)](https://arxiv.org/abs/2006.11477). | |
| We learned speech representations in multiple languages as well in [Unsupervised Cross-lingual Representation Learning for Speech Recognition (Conneau et al., 2020)](https://arxiv.org/abs/2006.13979). | |
| We also combined wav2vec 2.0 with self-training in [Self-training and Pre-training are Complementary for Speech Recognition (Xu et al., 2020)](https://arxiv.org/abs/2010.11430). | |
| We combined speech data from multiple domains in [Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training (Hsu, et al., 2021)](https://arxiv.org/abs/2104.01027) | |
| ## Pre-trained models | |
| Model | Finetuning split | Dataset | Model | |
| |---|---|---|--- | |
| Wav2Vec 2.0 Base | No finetuning | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt) | |
| Wav2Vec 2.0 Base | 10 minutes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_10m.pt) | |
| Wav2Vec 2.0 Base | 100 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_100h.pt) | |
| Wav2Vec 2.0 Base | 960 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt) | |
| Wav2Vec 2.0 Large | No finetuning | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/libri960_big.pt) | |
| Wav2Vec 2.0 Large | 10 minutes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_10m.pt) | |
| Wav2Vec 2.0 Large | 100 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_100h.pt) | |
| Wav2Vec 2.0 Large | 960 hours | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_big_960h.pt) | |
| Wav2Vec 2.0 Large (LV-60)* | No finetuning | [Libri-Light](https://github.com/facebookresearch/libri-light) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt) | |
| Wav2Vec 2.0 Large (LV-60)* | 10 minutes | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_10m_new.pt) | |
| Wav2Vec 2.0 Large (LV-60)* | 100 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_100h_new.pt) | |
| Wav2Vec 2.0 Large (LV-60)* | 960 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec2_vox_960h_new.pt) | |
| Wav2Vec 2.0 Large (LV-60) + Self Training * | 10 minutes | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_10m_pl.pt) | |
| Wav2Vec 2.0 Large (LV-60) + Self Training * | 100 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_100h_pl.pt) | |
| Wav2Vec 2.0 Large (LV-60) + Self Training * | 960 hours | [Libri-Light](https://github.com/facebookresearch/libri-light) + [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_960h_pl.pt) | |
| Wav2Vec 2.0 Large (LV-60 + CV + SWBD + FSH) ** | No finetuning | [Libri-Light](https://github.com/facebookresearch/libri-light) + [CommonVoice](https://commonvoice.mozilla.org/en/languages) + [Switchboard](https://catalog.ldc.upenn.edu/LDC97S62) + [Fisher](https://catalog.ldc.upenn.edu/LDC2004T19) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/w2v_large_lv_fsh_swbd_cv.pt) | |
| Wav2Vec 2.0 Large (LV-60 + CV + SWBD + FSH) ** | 960 hours Librispeech | [Libri-Light](https://github.com/facebookresearch/libri-light) + [CommonVoice](https://commonvoice.mozilla.org/en/languages) + [Switchboard](https://catalog.ldc.upenn.edu/LDC97S62) + [Fisher](https://catalog.ldc.upenn.edu/LDC2004T19) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/w2v_large_lv_fsh_swbd_cv_ftls960.pt) | |
| Wav2Vec 2.0 Large (LV-60 + CV + SWBD + FSH) ** | 300 hours Switchboard | [Libri-Light](https://github.com/facebookresearch/libri-light) + [CommonVoice](https://commonvoice.mozilla.org/en/languages) + [Switchboard](https://catalog.ldc.upenn.edu/LDC97S62) + [Fisher](https://catalog.ldc.upenn.edu/LDC2004T19) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/w2v_large_lv_fsh_swbd_cv_ftsb300.pt) | |
| \* updated (Oct. 24, 2020)\ | |
| ** updated (Jul. 8, 2021) | |
| We also release multilingual pre-trained wav2vec 2.0 (XLSR) models: | |
| Model | Architecture | Hours | Languages | Datasets | Model | |
| |---|---|---|---|---|--- | |
| XLSR-53 | Large | 56k | 53 | MLS, CommonVoice, BABEL | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt) | |
| The XLSR model uses the following datasets for multilingual pretraining: | |
| * **[MLS: Multilingual LibriSpeech](https://indico2.conference4me.psnc.pl/event/35/contributions/3585/attachments/1060/1101/Wed-2-6-10.pdf)** (8 languages, 50.7k hours): *Dutch, English, French, German, Italian, Polish, Portuguese, Spanish* | |
| * **[CommonVoice](https://commonvoice.mozilla.org/en/languages)** (36 languages, 3.6k hours): *Arabic, Basque, Breton, Chinese (CN), Chinese (HK), Chinese (TW), Chuvash, Dhivehi, Dutch, English, Esperanto, Estonian, French, German, Hakh-Chin, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Mongolian, Persian, Portuguese, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Welsh* (see also [finetuning splits]([https://dl.fbaipublicfiles.com/cpc_audio/common_voices_splits.tar.gz]) from [this paper](https://arxiv.org/abs/2002.02848)). | |
| * **[Babel](https://catalog.ldc.upenn.edu/byyear)** (17 languages, 1.7k hours): *Assamese, Bengali, Cantonese, Cebuano, Georgian, Haitian, Kazakh, Kurmanji, Lao, Pashto, Swahili, Tagalog, Tamil, Tok, Turkish, Vietnamese, Zulu* | |
| ## Training a new model with the CLI tools | |
| Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length) | |
| ### Prepare training data manifest: | |
| First, install the `soundfile` library: | |
| ```shell script | |
| pip install soundfile | |
| ``` | |
| Next, run: | |
| ```shell script | |
| $ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid | |
| ``` | |
| $ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read. | |
| $valid should be set to some reasonable percentage (like 0.01) of training data to use for validation. | |
| To use a pre-defined validation set (like dev-other from librispeech), set to it 0 and then overwrite valid.tsv with a | |
| separately pre-processed manifest file. | |
| ### Train a wav2vec 2.0 base model: | |
| This configuration was used for the base model trained on the Librispeech dataset in the wav2vec 2.0 paper | |
| Note that the input is expected to be single channel, sampled at 16 kHz | |
| ```shell script | |
| $ fairseq-hydra-train \ | |
| task.data=/path/to/data \ | |
| --config-dir /path/to/fairseq-py/examples/wav2vec/config/pretraining \ | |
| --config-name wav2vec2_base_librispeech | |
| ``` | |
| Note: you can simulate 64 GPUs by using k GPUs and adding command line parameters (before `--config-dir`) | |
| `distributed_training.distributed_world_size=k` `+optimization.update_freq='[x]'` where x = 64/k | |
| ### Train a wav2vec 2.0 large model: | |
| This configuration was used for the large model trained on the Libri-light dataset in the wav2vec 2.0 paper | |
| ```shell script | |
| $ fairseq-hydra-train \ | |
| task.data=/path/to/data \ | |
| --config-dir /path/to/fairseq-py/examples/wav2vec/config/pretraining \ | |
| --config-name wav2vec2_large_librivox | |
| ``` | |
| Note: you can simulate 128 GPUs by using k GPUs and adding command line parameters (before `--config-dir`) | |
| `distributed_training.distributed_world_size=k` `+optimization.update_freq='[x]'` where x = 128/k | |
| ### Fine-tune a pre-trained model with CTC: | |
| Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format. | |
| A letter vocabulary can be downloaded [here](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt). | |
| An example [script](libri_labels.py) that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows: | |
| ```shell script | |
| split=train | |
| $ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $split | |
| ``` | |
| Fine-tuning on 100h of Librispeech with letter targets: | |
| ```shell script | |
| $ fairseq-hydra-train \ | |
| distributed_training.distributed_port=$PORT \ | |
| task.data=/path/to/data \ | |
| model.w2v_path=/path/to/model.pt \ | |
| --config-dir /path/to/fairseq-py/examples/wav2vec/config/finetuning \ | |
| --config-name base_100h | |
| ``` | |
| There are other config files in the config/finetuning directory that can be used to fine-tune on other splits. | |
| You can specify the right config via the `--config-name` parameter. | |
| Note: you can simulate 24 GPUs by using k GPUs and adding command line parameters (before `--config-dir`) | |
| `distributed_training.distributed_world_size=k` `+optimization.update_freq='[x]'` where x = 24/k | |
| Decoding with a language model during training requires flashlight [python bindings](https://github.com/facebookresearch/flashlight/tree/master/bindings/python) (previously called [wav2letter](https://github.com/facebookresearch/wav2letter). | |
| If you want to use a language model, add `+criterion.wer_args='[/path/to/kenlm, /path/to/lexicon, 2, -1]'` to the command line. | |
| ### Evaluating a CTC model: | |
| Evaluating a CTC model with a language model requires [flashlight python bindings](https://github.com/facebookresearch/flashlight/tree/master/bindings/python) (previously called [wav2letter](https://github.com/facebookresearch/wav2letter) to be installed. | |
| Fairseq transformer language model used in the wav2vec 2.0 paper can be obtained from the [wav2letter model repository](https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019). | |
| Be sure to upper-case the language model vocab after downloading it. | |
| Letter dictionary for pre-trained models can be found [here](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt). | |
| Next, run the evaluation command: | |
| ```shell script | |
| $subset=dev_other | |
| python examples/speech_recognition/infer.py /checkpoint/abaevski/data/speech/libri/10h/wav2vec/raw --task audio_finetuning \ | |
| --nbest 1 --path /path/to/model --gen-subset $subset --results-path /path/to/save/results/for/sclite --w2l-decoder kenlm \ | |
| --lm-model /path/to/kenlm.bin --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \ | |
| --post-process letter | |
| ``` | |
| To get raw numbers, use --w2l-decoder viterbi and omit the lexicon. To use the transformer language model, use --w2l-decoder fairseqlm. | |
| ## Use wav2vec 2.0 with 🤗Transformers: | |
| Wav2Vec2 is also available in the [🤗Transformers library](https://github.com/huggingface/transformers) since version 4.4. | |
| Pretrained Models can be found on the [hub](https://huggingface.co/models?filter=wav2vec2) | |
| and documentation can be found [here](https://huggingface.co/transformers/master/model_doc/wav2vec2.html). | |
| Usage example: | |
| ```python | |
| # !pip install transformers | |
| # !pip install datasets | |
| import soundfile as sf | |
| import torch | |
| from datasets import load_dataset | |
| from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor | |
| # load pretrained model | |
| processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") | |
| model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") | |
| librispeech_samples_ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") | |
| # load audio | |
| audio_input, sample_rate = sf.read(librispeech_samples_ds[0]["file"]) | |
| # pad input values and return pt tensor | |
| input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values | |
| # INFERENCE | |
| # retrieve logits & take argmax | |
| logits = model(input_values).logits | |
| predicted_ids = torch.argmax(logits, dim=-1) | |
| # transcribe | |
| transcription = processor.decode(predicted_ids[0]) | |
| # FINE-TUNE | |
| target_transcription = "A MAN SAID TO THE UNIVERSE I EXIST" | |
| # encode labels | |
| with processor.as_target_processor(): | |
| labels = processor(target_transcription, return_tensors="pt").input_ids | |
| # compute loss by passing labels | |
| loss = model(input_values, labels=labels).loss | |
| loss.backward() | |
| ``` | |
| # wav2vec | |
| Example to train a wav2vec model as described in [wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019)](https://arxiv.org/abs/1904.05862). | |
| ## Pre-trained models | |
| Description | Dataset | Model | |
| ---|---|--- | |
| Wav2Vec large | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_large.pt) | |
| #### Example usage: | |
| ```python | |
| import torch | |
| import fairseq | |
| cp_path = '/path/to/wav2vec.pt' | |
| model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([cp_path]) | |
| model = model[0] | |
| model.eval() | |
| wav_input_16khz = torch.randn(1,10000) | |
| z = model.feature_extractor(wav_input_16khz) | |
| c = model.feature_aggregator(z) | |
| ``` | |
| ## Training a new model with the CLI tools | |
| Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate files 10 to 30 seconds in length) | |
| ### Prepare training data manifest: | |
| ``` | |
| $ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext wav | |
| ``` | |
| ### Train a wav2vec model: | |
| ``` | |
| $ python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \ | |
| --arch wav2vec --task audio_pretraining --min-lr 1e-06 --stop-min-lr 1e-09 --optimizer adam --lr 0.005 --lr-scheduler cosine \ | |
| --conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)] \ | |
| --conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \ | |
| --skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 \ | |
| --max-sample-size 150000 --max-tokens 1500000 --skip-invalid-size-inputs-valid-test | |
| ``` | |
| ### Run wav2vec2 pre-training on Google Cloud TPUs: | |
| Wav2Vec2 is now supported on TPUs! It's currently pre-training only. | |
| #### Using hydra on a v3-8: | |
| ``` | |
| $ OMP_NUM_THREADS=1 fairseq-hydra-train \ | |
| task.data=/manifest/path \ | |
| --config-dir /PATH/TO/FAIRSEQ/examples/wav2vec/config/pretraining \ | |
| --config-name wav2vec2_large_librivox_tpu.yaml | |
| ``` | |
| #### Using command line arguments on a v3-8: | |
| Note: Commandline arguments way of execution has a [known-problem](https://github.com/pytorch/fairseq/issues/3741) currently. | |
| ``` | |
| $ OMP_NUM_THREADS=1 python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \ | |
| --arch wav2vec2 --task audio_pretraining --min-lr 1e-06 --stop-min-lr 1e-09 --optimizer adam --lr 0.005 --lr-scheduler cosine \ | |
| --conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)] \ | |
| --conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \ | |
| --skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 \ | |
| --max-sample-size 150000 --max-tokens 1500000 --skip-invalid-size-inputs-valid-test \ | |
| --tpu --distributed-world-size 8 --num-batch-buckets 3 --enable-padding \ | |
| --encoder-layerdrop 0 --mask-channel-prob 0.1 | |
| ``` | |
| #### Using hydra on a pod slice (v3-N with N > 8): | |
| ``` | |
| $ OMP_NUM_THREADS=1 fairseq-hydra-train \ | |
| task.data=/manifest/path \ | |
| --config-dir /PATH/TO/FAIRSEQ/examples/wav2vec/config/pretraining \ | |
| --config-name wav2vec2_large_librivox_tpu-pod.yaml # edit distributed-world-size accordingly | |
| ``` | |
| #### Using command line arguments on a pod slice (v3-N with N > 8): | |
| Note: Commandline arguments way of execution has a [known-problem](https://github.com/pytorch/fairseq/issues/3741) currently. | |
| ``` | |
| $ python -m torch_xla.distributed.xla_dist \ | |
| --tpu ${TPUNAME} --conda-env=torch-xla-${TORCH_XLA_VERSION} --env OMP_NUM_THREADS=1 \ | |
| -- \ | |
| python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 --save-interval 1 --no-epoch-checkpoints \ | |
| --arch wav2vec2 --task audio_pretraining --min-lr 1e-06 --stop-min-lr 1e-09 --optimizer adam --lr 0.005 --lr-scheduler cosine \ | |
| --conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1)] \ | |
| --conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \ | |
| --skip-connections-agg --residual-scale 0.5 --log-compression --warmup-updates 500 --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 \ | |
| --max-sample-size 150000 --max-tokens 1500000 --skip-invalid-size-inputs-valid-test \ | |
| --tpu --distributed-world-size ${WORLD_SIZE} --num-batch-buckets 3 --enable-padding \ | |
| --encoder-layerdrop 0 --mask-channel-prob 0.1 | |
| ``` | |
| ### Extract embeddings from the downstream task data: | |
| ``` | |
| $ PYTHONPATH=/path/to/fairseq python examples/wav2vec/wav2vec_featurize.py --input /path/to/task/waves --output /path/to/output \ | |
| --model /model/path/checkpoint_best.pt --split train valid test | |
| ``` | |
| # vq-wav2vec | |
| Example to train a vq-wav2vec model as described in [vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations (Baevski et al., 2019)](https://arxiv.org/abs/1910.05453). | |
| These models are also used in [Effectiveness of self-supervised pre-training for speech recognition (Baevski et al., 2019)](https://arxiv.org/abs/1911.03912). | |
| ## Pre-trained models | |
| Description | Dataset | Model | |
| ---|---|--- | |
| vq-wav2vec Gumbel | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec.pt) | |
| vq-wav2vec K-means | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/vq-wav2vec_kmeans.pt) | |
| Roberta on K-means codes | [Librispeech](http://www.openslr.org/12) | [download](https://dl.fbaipublicfiles.com/fairseq/wav2vec/bert_kmeans.tar) | |
| #### Example usage: | |
| ```python | |
| import torch | |
| import fairseq | |
| cp = torch.load('/path/to/vq-wav2vec.pt') | |
| model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([cp]) | |
| model = model[0] | |
| model.eval() | |
| wav_input_16khz = torch.randn(1,10000) | |
| z = model.feature_extractor(wav_input_16khz) | |
| _, idxs = model.vector_quantizer.forward_idx(z) | |
| print(idxs.shape) # output: torch.Size([1, 60, 2]), 60 timesteps with 2 indexes corresponding to 2 groups in the model | |
| ``` | |
| ## Training a new model with the CLI tools | |
| Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length) | |
| ### Prepare training data manifest: | |
| ``` | |
| $ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext wav | |
| ``` | |
| ### Train a gumbel vq-wav2vec model: | |
| ``` | |
| $ python train.py /manifest/path --save-dir /model/path --num-workers 6 --fp16 --max-update 400000 \ | |
| --save-interval 1 --no-epoch-checkpoints --arch wav2vec --task audio_pretraining --min-lr 1e-06 --stop-min-lr 1e-09 \ | |
| --optimizer adam --lr 1e-05 --lr-scheduler cosine \ | |
| --conv-feature-layers [(512, 10, 5), (512, 8, 4), (512, 4, 2), (512, 4, 2), (512, 4, 2), (512, 1, 1), (512, 1, 1), (512, 1, 1)] \ | |
| --conv-aggregator-layers [(512, 2, 1), (512, 3, 1), (512, 4, 1), (512, 5, 1), (512, 6, 1), (512, 7, 1), (512, 8, 1), (512, 9, 1), (512, 10, 1), (512, 11, 1), (512, 12, 1), (512, 13, 1)] \ | |
| --activation gelu --offset auto --skip-connections-agg --residual-scale 0.5 \ | |
| --log-keys ["prob_perplexity","code_perplexity","temp"] --vq-type gumbel --vq-groups 2 --vq-depth 2 \ | |
| --combine-groups --vq-vars 320 --vq-temp (2,0.5,0.999995) --prediction-steps 12 --warmup-updates 1000 \ | |
| --warmup-init-lr 1e-07 --criterion wav2vec --num-negatives 10 --max-sample-size 150000 \ | |
| --max-tokens 300000 --cross-sample-negatives 0 --update-freq 1 --seed 2 --skip-invalid-size-inputs-valid-test | |
| ``` | |
| for k-means training, set vq-type with "kmeans" and add --loss-weights [1] argument. Pre-trained models were trained on 16 GPUs. | |
| ### Tokenize audio data (e.g. for BERT training): | |
| ``` | |
| $ PYTHONPATH=/path/to/fairseq python examples/wav2vec/vq-wav2vec_featurize.py --data-dir /manifest/path --output-dir /path/to/output \ | |
| --checkpoint /model/path/checkpoint_best.pt --split train valid test --extension tsv | |
| ``` | |