shreeharsha reach-vb commited on
Commit
aea7090
·
verified ·
0 Parent(s):

Duplicate from kyutai/moshiko-pytorch-bf16

Browse files

Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ license: cc-by-4.0
5
+ language:
6
+ - en
7
+ library_name: moshi
8
+ ---
9
+
10
+ # Model Card for Moshi
11
+
12
+
13
+ Moshi is a speech-text foundation model and full-duplex spoken dialogue framework
14
+
15
+ ## Model Details
16
+
17
+ Pytorch version quantized in bf16 precision.
18
+
19
+ ### Model Description
20
+
21
+ Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics.
22
+ Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner
23
+ Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice.
24
+
25
+
26
+ - **Developed by:** Kyutai
27
+ - **Model type:** Multimodal speech-text foundation model
28
+ - **Language(s) (NLP):** English
29
+ - **License:** CC-BY
30
+
31
+ ### Model Sources
32
+
33
+
34
+ - **Repository:** [repo](https://github.com/kyutai-labs/moshi)
35
+ - **Paper:** [paper](http://kyutai.org/Moshi.pdf)
36
+ - **Demo:** [demo](https://moshi.chat/)
37
+
38
+ ## Uses
39
+
40
+ ### Direct Use
41
+
42
+ The model can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. However, the model has limited abilities for complex tasks and cannot access tools, but rather focues on natural, low-latency interactions.
43
+
44
+
45
+ ### Downstream Use
46
+
47
+ Some components of the model can be used independently or repurposed relatively easily.
48
+ For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems.. Regarding the main Moshi architecture, other downstream usecases would require some finetuning / domain adaptation.
49
+
50
+
51
+ ### Out-of-Scope Use
52
+
53
+ The model is not intended to be used to impersonate other people or any malicious use of any kind.
54
+ This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.
55
+
56
+
57
+ ## Bias, Risks, and Limitations
58
+
59
+ The model has been trained with a few safeguards to try to limit potential toxic usages, however our toxicity analysis shows that it behaves in the middle of existing models with respect to textual generation. It has some bias towards certain domains and topics that are over-represented in the training data. Its capabilities are relatively limited so far and it is trained to produce only one voice to avoid impersonation. Yet, we need the perspective in time to establish the sociotechnical limitations.
60
+
61
+
62
+ ## How to Get Started with the Model
63
+
64
+ See the main [README](https://github.com/kyutai-labs/moshi) file.
65
+
66
+ ## Training Details
67
+
68
+ ### Training Data
69
+
70
+ - Textual data: The underlying Helium model is trained on a mix of data, more precisely:
71
+
72
+ - 12.5% is high-quality data sources from the following curated sources: [Wikipedia](https://dumps.wikimedia.org/) Wikibooks, Wikisource, Wikinews,
73
+ [StackExchange](https://archive.org/details/stackexchange) and the collection of [scientific articles pes2o](https://github.com/allenai/peS2o). For Wikipedia, we use five different dumps from 2017, 2018, 2019, 2021 and 2022.
74
+ - 87.5% is filtered web data from CommonCrawl, using the following crawls: 2018-30, 2019-04, 2019-30, 2020-05, 2020-34, 2021-04, 2021-31, 2022-05, 2022-33, 2023-40.
75
+
76
+ - Audio data
77
+
78
+ - **Unsupervised audio dataset:** used for pre-training, this is a collection of 7 million hours of readily available audio content, which consists mostly of English speech. This training set is transcribed with [Whisper](https://github.com/openai/whisper) (large v3 model)
79
+ - **The Fisher dataset:**: used to enable multi-stream. It consists of 2000 hours of phone conversations at 8kHz from Fisher, which we upsample to 24kHz using [AudioSR](https://audioldm.github.io/audiosr/).
80
+ - **Supervised multi-stream dataset:** A dataset of 170 hours of natural and scripted conversation between multiple pairs of participants, collected by Kyutai. This dataset is used to train the TTS system used to create synthetic data.
81
+ - **Synthetic data:** 20,000 hours of synthetic data generated by our TTS system, and simulating a dialogue between Moshi and a user.
82
+
83
+ ### Training procedure and hyper-parameters
84
+
85
+ The different stages of the training procedure are detailled in the paper along with the hyper-parameters.
86
+
87
+ ### Compute Infrastructure
88
+
89
+ The training was performed on 127 DGX nodes provided by Scaleway, accounting for 1016 H100 Nvidia GPUs.
90
+
91
+
92
+ ## Citation
93
+
94
+ ```
95
+ @techreport{kyutai2024moshi,
96
+ author = {Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
97
+ title = {Moshi: a speech-text foundation model for real-time dialogue},
98
+ institution = {Kyutai},
99
+ year={2024},
100
+ month={September},
101
+ url={http://kyutai.org/Moshi.pdf},
102
+ }
103
+ ```
104
+
105
+
106
+ ## Model Card Authors
107
+
108
+ Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5b835c664f3830bf453808cbca9bfbcc9de332c328cc01cbffdfbaba2a8838a7
3
+ size 15375500136
tokenizer-e351c8d8-checkpoint125.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09b782f0629851a271227fb9d36db65c041790365f11bbe5d3d59369cf863f50
3
+ size 384644900
tokenizer_spm_32k_3.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:78d4336533ddc26f9acf7250d7fb83492152196c6ea4212c841df76933f18d2d
3
+ size 552778