stefan-it commited on
Commit
125ed8f
·
verified ·
1 Parent(s): d36a72c

docs: introduce section about tokenizer fixes

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -15,6 +15,7 @@ inference: false
15
  # ModernBERT
16
 
17
  ## Table of Contents
 
18
  1. [Model Summary](#model-summary)
19
  2. [Usage](#Usage)
20
  3. [Evaluation](#Evaluation)
@@ -23,6 +24,17 @@ inference: false
23
  6. [License](#license)
24
  7. [Citation](#citation)
25
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## Model Summary
27
 
28
  ModernBERT is a modernized bidirectional encoder-only Transformer model (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. ModernBERT leverages recent architectural improvements such as:
 
15
  # ModernBERT
16
 
17
  ## Table of Contents
18
+ 0. [Tokenizer Fix](#tokenizer-fix)
19
  1. [Model Summary](#model-summary)
20
  2. [Usage](#Usage)
21
  3. [Evaluation](#Evaluation)
 
24
  6. [License](#license)
25
  7. [Citation](#citation)
26
 
27
+ ## Tokenizer Fix
28
+
29
+ This repository is a fork of the original [ModernBERT Large](https://huggingface.co/answerdotai/ModernBERT-large).
30
+
31
+ Due to not optimal performance on token classification tasks - as reported [here](https://github.com/AnswerDotAI/ModernBERT/issues/149) - the following fixes w.r.t. the tokenizer are applied:
32
+
33
+ * `add_prefix_space` is set to `True`
34
+ * `tokenizer_class` is set to `RobertaTokenizerFast`
35
+
36
+ More experiments on token classification tasks can be found in my [ModernBERT NER repo](https://github.com/stefan-it/modern-bert-ner).
37
+
38
  ## Model Summary
39
 
40
  ModernBERT is a modernized bidirectional encoder-only Transformer model (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. ModernBERT leverages recent architectural improvements such as: