MediaCatch
/

mmBERT-base-scandi-ner

@@ -72,15 +72,16 @@ entities = ner_pipeline(text)
 for entity in entities:
     print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")
-Supported Entity Types
 The model predicts the following entity types using BIO tagging:
 PER (Person): Names of people
 ORG (Organization): Companies, institutions, organizations
 LOC (Location): Geographic locations, places
-Training Data
 The model was trained on a combination of the following datasets:
 - **eriktks/conll2003**: 20,682 examples
 - **NbAiLab/norne_bokmaal-7**: 20,044 examples
@@ -99,7 +100,8 @@ The model was trained on a combination of the following datasets:
 - **MultiCoNER/multiconer_v2_Swedish (SV)**: 248,409 examples
 - **MultiCoNER/multiconer_v2_English (EN)**: 267,629 examples
 - **MultiCoNER/multiconer_v2_German (DE)**: 30,442 examples
-Dataset Statistics
 Total examples: 943,804
 Average sequence length: 11.8 tokens
@@ -113,8 +115,8 @@ Label distribution:
   - I-ORG: 276,449 (2.5%)
   - I-LOC: 144,536 (1.3%)
-Training Details
-Training Hyperparameters
 Base model: jhu-clsp/mmBERT-base
 Training epochs: 30
@@ -123,21 +125,24 @@ Learning rate: 2e-05
 Warmup steps: 5000
 Weight decay: 0.01
-Training Infrastructure
 Mixed precision: False
 Gradient accumulation: 1
 Early stopping: Enabled with patience=3
-Usage Examples
-Basic NER Tagging
 text = "Olof Palme var Sveriges statsminister."
 entities = ner_pipeline(text)
 # Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]
-Batch Processing
 texts = [
     "Microsoft fue fundada por Bill Gates.",
     "Angela Merkel var förbundskansler i Tyskland.",
@@ -149,8 +154,9 @@ for text in texts:
     print(f"Text: {text}")
     for entity in entities:
         print(f"  {entity['word']} -> {entity['entity_group']}")
-Limitations and Considerations
 Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
 Subword handling: The model uses subword tokenization; ensure proper aggregation

 for entity in entities:
     print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")
+```
+## Supported Entity Types
 The model predicts the following entity types using BIO tagging:
 PER (Person): Names of people
 ORG (Organization): Companies, institutions, organizations
 LOC (Location): Geographic locations, places
+## Training Data
 The model was trained on a combination of the following datasets:
 - **eriktks/conll2003**: 20,682 examples
 - **NbAiLab/norne_bokmaal-7**: 20,044 examples
 - **MultiCoNER/multiconer_v2_Swedish (SV)**: 248,409 examples
 - **MultiCoNER/multiconer_v2_English (EN)**: 267,629 examples
 - **MultiCoNER/multiconer_v2_German (DE)**: 30,442 examples
+## Dataset Statistics
 Total examples: 943,804
 Average sequence length: 11.8 tokens
   - I-ORG: 276,449 (2.5%)
   - I-LOC: 144,536 (1.3%)
+## Training Details
+### Training Hyperparameters
 Base model: jhu-clsp/mmBERT-base
 Training epochs: 30
 Warmup steps: 5000
 Weight decay: 0.01
+### Training Infrastructure
 Mixed precision: False
 Gradient accumulation: 1
 Early stopping: Enabled with patience=3
+## Usage Examples
+### Basic NER Tagging
+```python
 text = "Olof Palme var Sveriges statsminister."
 entities = ner_pipeline(text)
 # Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]
+```
+### Batch Processing
+```python
 texts = [
     "Microsoft fue fundada por Bill Gates.",
     "Angela Merkel var förbundskansler i Tyskland.",
     print(f"Text: {text}")
     for entity in entities:
         print(f"  {entity['word']} -> {entity['entity_group']}")
+```
+## Limitations and Considerations
 Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
 Subword handling: The model uses subword tokenization; ensure proper aggregation