stefan-it
/

ModernBERT-large-tokenizer-fix

@@ -15,6 +15,7 @@ inference: false
 # ModernBERT
 ## Table of Contents
 1. [Model Summary](#model-summary)
 2. [Usage](#Usage)
 3. [Evaluation](#Evaluation)
@@ -23,6 +24,17 @@ inference: false
 6. [License](#license)
 7. [Citation](#citation)
 ## Model Summary
 ModernBERT is a modernized bidirectional encoder-only Transformer model (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. ModernBERT leverages recent architectural improvements such as:

 # ModernBERT
 ## Table of Contents
+0. [Tokenizer Fix](#tokenizer-fix)
 1. [Model Summary](#model-summary)
 2. [Usage](#Usage)
 3. [Evaluation](#Evaluation)
 6. [License](#license)
 7. [Citation](#citation)
+## Tokenizer Fix
+This repository is a fork of the original [ModernBERT Large](https://huggingface.co/answerdotai/ModernBERT-large).
+Due to not optimal performance on token classification tasks - as reported [here](https://github.com/AnswerDotAI/ModernBERT/issues/149) - the following fixes w.r.t. the tokenizer are applied:
+* `add_prefix_space` is set to `True`
+* `tokenizer_class` is set to `RobertaTokenizerFast`
+More experiments on token classification tasks can be found in my [ModernBERT NER repo](https://github.com/stefan-it/modern-bert-ner).
 ## Model Summary
 ModernBERT is a modernized bidirectional encoder-only Transformer model (BERT-style) pre-trained on 2 trillion tokens of English and code data with a native context length of up to 8,192 tokens. ModernBERT leverages recent architectural improvements such as: