Add details from Suno
Browse files
README.md
CHANGED
|
@@ -33,3 +33,42 @@ huggingface-cli download --local-dir-use-symlinks False --local-dir weights/ mlx
|
|
| 33 |
|
| 34 |
# Run example (large model)
|
| 35 |
python model.py --text="Hello world!" --path weights/ --model large
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
# Run example (large model)
|
| 35 |
python model.py --text="Hello world!" --path weights/ --model large
|
| 36 |
+
```
|
| 37 |
+
The rest of the model card was copied from [the original Bark repository](https://huggingface.co/suno/bark)
|
| 38 |
+
|
| 39 |
+
## Model Details
|
| 40 |
+
|
| 41 |
+
The following is additional information about the models released here.
|
| 42 |
+
|
| 43 |
+
Bark is a series of three transformer models that turn text into audio.
|
| 44 |
+
|
| 45 |
+
### Text to semantic tokens
|
| 46 |
+
- Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
|
| 47 |
+
- Output: semantic tokens that encode the audio to be generated
|
| 48 |
+
|
| 49 |
+
### Semantic to coarse tokens
|
| 50 |
+
- Input: semantic tokens
|
| 51 |
+
- Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook
|
| 52 |
+
|
| 53 |
+
### Coarse to fine tokens
|
| 54 |
+
- Input: the first two codebooks from EnCodec
|
| 55 |
+
- Output: 8 codebooks from EnCodec
|
| 56 |
+
|
| 57 |
+
### Architecture
|
| 58 |
+
| Model | Parameters | Attention | Output Vocab size |
|
| 59 |
+
|:-------------------------:|:----------:|------------|:-----------------:|
|
| 60 |
+
| Text to semantic tokens | 80/300 M | Causal | 10,000 |
|
| 61 |
+
| Semantic to coarse tokens | 80/300 M | Causal | 2x 1,024 |
|
| 62 |
+
| Coarse to fine tokens | 80/300 M | Non-causal | 6x 1,024 |
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
### Release date
|
| 66 |
+
April 2023
|
| 67 |
+
|
| 68 |
+
## Broader Implications
|
| 69 |
+
We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages.
|
| 70 |
+
|
| 71 |
+
While we hope that this release will enable users to express their creativity and build applications that are a force
|
| 72 |
+
for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward
|
| 73 |
+
to voice clone known people with Bark, it can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark,
|
| 74 |
+
we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository).
|