docs: introduce section about tokenizer fixes
Browse files
README.md
CHANGED
|
@@ -15,6 +15,7 @@ inference: false
|
|
| 15 |
# ModernBERT
|
| 16 |
|
| 17 |
## Table of Contents
|
|
|
|
| 18 |
1. [Model Summary](#model-summary)
|
| 19 |
2. [Usage](#Usage)
|
| 20 |
3. [Evaluation](#Evaluation)
|
|
@@ -23,6 +24,17 @@ inference: false
|
|
| 23 |
6. [License](#license)
|
| 24 |
7. [Citation](#citation)
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
## Model Summary
|
| 27 |
|
| 28 |
ModernBERT is a modernized bidirectional encoder-only Transformer model (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. ModernBERT leverages recent architectural improvements such as:
|
|
|
|
| 15 |
# ModernBERT
|
| 16 |
|
| 17 |
## Table of Contents
|
| 18 |
+
0. [Tokenizer Fix](#tokenizer-fix)
|
| 19 |
1. [Model Summary](#model-summary)
|
| 20 |
2. [Usage](#Usage)
|
| 21 |
3. [Evaluation](#Evaluation)
|
|
|
|
| 24 |
6. [License](#license)
|
| 25 |
7. [Citation](#citation)
|
| 26 |
|
| 27 |
+
## Tokenizer Fix
|
| 28 |
+
|
| 29 |
+
This repository is a fork of the original [ModernBERT Large](https://huggingface.co/answerdotai/ModernBERT-large).
|
| 30 |
+
|
| 31 |
+
Due to not optimal performance on token classification tasks - as reported [here](https://github.com/AnswerDotAI/ModernBERT/issues/149) - the following fixes w.r.t. the tokenizer are applied:
|
| 32 |
+
|
| 33 |
+
* `add_prefix_space` is set to `True`
|
| 34 |
+
* `tokenizer_class` is set to `RobertaTokenizerFast`
|
| 35 |
+
|
| 36 |
+
More experiments on token classification tasks can be found in my [ModernBERT NER repo](https://github.com/stefan-it/modern-bert-ner).
|
| 37 |
+
|
| 38 |
## Model Summary
|
| 39 |
|
| 40 |
ModernBERT is a modernized bidirectional encoder-only Transformer model (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. ModernBERT leverages recent architectural improvements such as:
|