Update README.md
Browse files
README.md
CHANGED
|
@@ -72,15 +72,16 @@ entities = ner_pipeline(text)
|
|
| 72 |
|
| 73 |
for entity in entities:
|
| 74 |
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")
|
|
|
|
| 75 |
|
| 76 |
-
Supported Entity Types
|
| 77 |
The model predicts the following entity types using BIO tagging:
|
| 78 |
|
| 79 |
PER (Person): Names of people
|
| 80 |
ORG (Organization): Companies, institutions, organizations
|
| 81 |
LOC (Location): Geographic locations, places
|
| 82 |
|
| 83 |
-
Training Data
|
| 84 |
The model was trained on a combination of the following datasets:
|
| 85 |
- **eriktks/conll2003**: 20,682 examples
|
| 86 |
- **NbAiLab/norne_bokmaal-7**: 20,044 examples
|
|
@@ -99,7 +100,8 @@ The model was trained on a combination of the following datasets:
|
|
| 99 |
- **MultiCoNER/multiconer_v2_Swedish (SV)**: 248,409 examples
|
| 100 |
- **MultiCoNER/multiconer_v2_English (EN)**: 267,629 examples
|
| 101 |
- **MultiCoNER/multiconer_v2_German (DE)**: 30,442 examples
|
| 102 |
-
|
|
|
|
| 103 |
|
| 104 |
Total examples: 943,804
|
| 105 |
Average sequence length: 11.8 tokens
|
|
@@ -113,8 +115,8 @@ Label distribution:
|
|
| 113 |
- I-ORG: 276,449 (2.5%)
|
| 114 |
- I-LOC: 144,536 (1.3%)
|
| 115 |
|
| 116 |
-
Training Details
|
| 117 |
-
Training Hyperparameters
|
| 118 |
|
| 119 |
Base model: jhu-clsp/mmBERT-base
|
| 120 |
Training epochs: 30
|
|
@@ -123,21 +125,24 @@ Learning rate: 2e-05
|
|
| 123 |
Warmup steps: 5000
|
| 124 |
Weight decay: 0.01
|
| 125 |
|
| 126 |
-
Training Infrastructure
|
| 127 |
|
| 128 |
Mixed precision: False
|
| 129 |
Gradient accumulation: 1
|
| 130 |
Early stopping: Enabled with patience=3
|
| 131 |
|
| 132 |
-
Usage Examples
|
| 133 |
-
Basic NER Tagging
|
| 134 |
|
|
|
|
| 135 |
text = "Olof Palme var Sveriges statsminister."
|
| 136 |
entities = ner_pipeline(text)
|
| 137 |
# Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]
|
|
|
|
| 138 |
|
| 139 |
-
Batch Processing
|
| 140 |
|
|
|
|
| 141 |
texts = [
|
| 142 |
"Microsoft fue fundada por Bill Gates.",
|
| 143 |
"Angela Merkel var förbundskansler i Tyskland.",
|
|
@@ -149,8 +154,9 @@ for text in texts:
|
|
| 149 |
print(f"Text: {text}")
|
| 150 |
for entity in entities:
|
| 151 |
print(f" {entity['word']} -> {entity['entity_group']}")
|
|
|
|
| 152 |
|
| 153 |
-
Limitations and Considerations
|
| 154 |
|
| 155 |
Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
|
| 156 |
Subword handling: The model uses subword tokenization; ensure proper aggregation
|
|
|
|
| 72 |
|
| 73 |
for entity in entities:
|
| 74 |
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")
|
| 75 |
+
```
|
| 76 |
|
| 77 |
+
## Supported Entity Types
|
| 78 |
The model predicts the following entity types using BIO tagging:
|
| 79 |
|
| 80 |
PER (Person): Names of people
|
| 81 |
ORG (Organization): Companies, institutions, organizations
|
| 82 |
LOC (Location): Geographic locations, places
|
| 83 |
|
| 84 |
+
## Training Data
|
| 85 |
The model was trained on a combination of the following datasets:
|
| 86 |
- **eriktks/conll2003**: 20,682 examples
|
| 87 |
- **NbAiLab/norne_bokmaal-7**: 20,044 examples
|
|
|
|
| 100 |
- **MultiCoNER/multiconer_v2_Swedish (SV)**: 248,409 examples
|
| 101 |
- **MultiCoNER/multiconer_v2_English (EN)**: 267,629 examples
|
| 102 |
- **MultiCoNER/multiconer_v2_German (DE)**: 30,442 examples
|
| 103 |
+
|
| 104 |
+
## Dataset Statistics
|
| 105 |
|
| 106 |
Total examples: 943,804
|
| 107 |
Average sequence length: 11.8 tokens
|
|
|
|
| 115 |
- I-ORG: 276,449 (2.5%)
|
| 116 |
- I-LOC: 144,536 (1.3%)
|
| 117 |
|
| 118 |
+
## Training Details
|
| 119 |
+
### Training Hyperparameters
|
| 120 |
|
| 121 |
Base model: jhu-clsp/mmBERT-base
|
| 122 |
Training epochs: 30
|
|
|
|
| 125 |
Warmup steps: 5000
|
| 126 |
Weight decay: 0.01
|
| 127 |
|
| 128 |
+
### Training Infrastructure
|
| 129 |
|
| 130 |
Mixed precision: False
|
| 131 |
Gradient accumulation: 1
|
| 132 |
Early stopping: Enabled with patience=3
|
| 133 |
|
| 134 |
+
## Usage Examples
|
| 135 |
+
### Basic NER Tagging
|
| 136 |
|
| 137 |
+
```python
|
| 138 |
text = "Olof Palme var Sveriges statsminister."
|
| 139 |
entities = ner_pipeline(text)
|
| 140 |
# Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]
|
| 141 |
+
```
|
| 142 |
|
| 143 |
+
### Batch Processing
|
| 144 |
|
| 145 |
+
```python
|
| 146 |
texts = [
|
| 147 |
"Microsoft fue fundada por Bill Gates.",
|
| 148 |
"Angela Merkel var förbundskansler i Tyskland.",
|
|
|
|
| 154 |
print(f"Text: {text}")
|
| 155 |
for entity in entities:
|
| 156 |
print(f" {entity['word']} -> {entity['entity_group']}")
|
| 157 |
+
```
|
| 158 |
|
| 159 |
+
## Limitations and Considerations
|
| 160 |
|
| 161 |
Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains
|
| 162 |
Subword handling: The model uses subword tokenization; ensure proper aggregation
|