readme: sync with upstream hmByT5 repo
Browse files
README.md
CHANGED
|
@@ -7,4 +7,41 @@ language:
|
|
| 7 |
- fi
|
| 8 |
- sv
|
| 9 |
- nl
|
| 10 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
- fi
|
| 8 |
- sv
|
| 9 |
- nl
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# hmByT5
|
| 13 |
+
|
| 14 |
+
Upcoming Historic Multilingual ByT5 Model. It covers the following languages:
|
| 15 |
+
|
| 16 |
+
* English (British Library Corpus - Books)
|
| 17 |
+
* German (Europeana Newspaper)
|
| 18 |
+
* French (Europeana Newspaper)
|
| 19 |
+
* Finnish (Europeana Newspaper)
|
| 20 |
+
* Swedish (Europeana Newspaper)
|
| 21 |
+
* Dutch (Delpher Corpus)
|
| 22 |
+
|
| 23 |
+
# Pretraining
|
| 24 |
+
|
| 25 |
+
We pretrain hmByT5 on a v3-32 TPU Pod. Details about the training can be found
|
| 26 |
+
[here](https://github.com/stefan-it/hmByT5/tree/main/hmbyt5).
|
| 27 |
+
|
| 28 |
+
# Evaluation on Downstream Tasks (NER)
|
| 29 |
+
|
| 30 |
+
We use Flair to fine-tune hmByT5 on HIPE-2022 data. Details about the fine-tuning can be found
|
| 31 |
+
[here](https://github.com/stefan-it/hmByT5/tree/main/bench).
|
| 32 |
+
|
| 33 |
+
# **New**: Logbook
|
| 34 |
+
|
| 35 |
+
* 07.04.2022: Pretraining for 200k steps on the English corpus finished without crashes! TensorBoard logs can be found
|
| 36 |
+
on the [Model Hub](https://huggingface.co/stefan-it/byt5-small-historic-multilingual/tensorboard). We
|
| 37 |
+
also uploaded all checkpoints (we checkpoint every 25k steps)
|
| 38 |
+
[here](https://huggingface.co/stefan-it/byt5-small-historic-multilingual). Fine-Tuning experiments (NER)
|
| 39 |
+
on the English part of AjMC corpus from HIPE-2022 are running.
|
| 40 |
+
* 03.04.2022: We start ByT5 pretraining with official ByT5 implementation on a v3-32 TPU Pod - thankfully provided by
|
| 41 |
+
[TPU Research Cloud](https://sites.research.google/trc/about/) (TRC) program. Plan is to pretrain on the
|
| 42 |
+
English corpus for 200k steps and use the original ByT5 Small model as init checkpoint.
|
| 43 |
+
|
| 44 |
+
# Acknowledgements
|
| 45 |
+
|
| 46 |
+
Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
|
| 47 |
+
Many Thanks for providing access to the TPUs ❤️
|