MCFred commited on
Commit
c7f46ec
·
verified ·
1 Parent(s): d4f3f55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -10
README.md CHANGED
@@ -72,15 +72,16 @@ entities = ner_pipeline(text)
72
 
73
  for entity in entities:
74
  print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")
 
75
 
76
- Supported Entity Types
77
  The model predicts the following entity types using BIO tagging:
78
 
79
  PER (Person): Names of people
80
  ORG (Organization): Companies, institutions, organizations
81
  LOC (Location): Geographic locations, places
82
 
83
- Training Data
84
  The model was trained on a combination of the following datasets:
85
  - **eriktks/conll2003**: 20,682 examples
86
  - **NbAiLab/norne_bokmaal-7**: 20,044 examples
@@ -99,7 +100,8 @@ The model was trained on a combination of the following datasets:
99
  - **MultiCoNER/multiconer_v2_Swedish (SV)**: 248,409 examples
100
  - **MultiCoNER/multiconer_v2_English (EN)**: 267,629 examples
101
  - **MultiCoNER/multiconer_v2_German (DE)**: 30,442 examples
102
- Dataset Statistics
 
103
 
104
  Total examples: 943,804
105
  Average sequence length: 11.8 tokens
@@ -113,8 +115,8 @@ Label distribution:
113
  - I-ORG: 276,449 (2.5%)
114
  - I-LOC: 144,536 (1.3%)
115
 
116
- Training Details
117
- Training Hyperparameters
118
 
119
  Base model: jhu-clsp/mmBERT-base
120
  Training epochs: 30
@@ -123,21 +125,24 @@ Learning rate: 2e-05
123
  Warmup steps: 5000
124
  Weight decay: 0.01
125
 
126
- Training Infrastructure
127
 
128
  Mixed precision: False
129
  Gradient accumulation: 1
130
  Early stopping: Enabled with patience=3
131
 
132
- Usage Examples
133
- Basic NER Tagging
134
 
 
135
  text = "Olof Palme var Sveriges statsminister."
136
  entities = ner_pipeline(text)
137
  # Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]
 
138
 
139
- Batch Processing
140
 
 
141
  texts = [
142
  "Microsoft fue fundada por Bill Gates.",
143
  "Angela Merkel var förbundskansler i Tyskland.",
@@ -149,8 +154,9 @@ for text in texts:
149
  print(f"Text: {text}")
150
  for entity in entities:
151
  print(f" {entity['word']} -> {entity['entity_group']}")
 
152
 
153
- Limitations and Considerations
154
 
155
  Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
156
  Subword handling: The model uses subword tokenization; ensure proper aggregation
 
72
 
73
  for entity in entities:
74
  print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")
75
+ ```
76
 
77
+ ## Supported Entity Types
78
  The model predicts the following entity types using BIO tagging:
79
 
80
  PER (Person): Names of people
81
  ORG (Organization): Companies, institutions, organizations
82
  LOC (Location): Geographic locations, places
83
 
84
+ ## Training Data
85
  The model was trained on a combination of the following datasets:
86
  - **eriktks/conll2003**: 20,682 examples
87
  - **NbAiLab/norne_bokmaal-7**: 20,044 examples
 
100
  - **MultiCoNER/multiconer_v2_Swedish (SV)**: 248,409 examples
101
  - **MultiCoNER/multiconer_v2_English (EN)**: 267,629 examples
102
  - **MultiCoNER/multiconer_v2_German (DE)**: 30,442 examples
103
+
104
+ ## Dataset Statistics
105
 
106
  Total examples: 943,804
107
  Average sequence length: 11.8 tokens
 
115
  - I-ORG: 276,449 (2.5%)
116
  - I-LOC: 144,536 (1.3%)
117
 
118
+ ## Training Details
119
+ ### Training Hyperparameters
120
 
121
  Base model: jhu-clsp/mmBERT-base
122
  Training epochs: 30
 
125
  Warmup steps: 5000
126
  Weight decay: 0.01
127
 
128
+ ### Training Infrastructure
129
 
130
  Mixed precision: False
131
  Gradient accumulation: 1
132
  Early stopping: Enabled with patience=3
133
 
134
+ ## Usage Examples
135
+ ### Basic NER Tagging
136
 
137
+ ```python
138
  text = "Olof Palme var Sveriges statsminister."
139
  entities = ner_pipeline(text)
140
  # Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]
141
+ ```
142
 
143
+ ### Batch Processing
144
 
145
+ ```python
146
  texts = [
147
  "Microsoft fue fundada por Bill Gates.",
148
  "Angela Merkel var förbundskansler i Tyskland.",
 
154
  print(f"Text: {text}")
155
  for entity in entities:
156
  print(f" {entity['word']} -> {entity['entity_group']}")
157
+ ```
158
 
159
+ ## Limitations and Considerations
160
 
161
  Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
162
  Subword handling: The model uses subword tokenization; ensure proper aggregation