Hebrew Manuscript Joint Entity-Role Extraction
Fine-tuned DictaBERT model for joint Named Entity Recognition (NER) and Role Classification on Hebrew manuscript MARC records.
Model Description
This model performs two tasks simultaneously:
- Named Entity Recognition: Identifies PERSON entities in Hebrew text
- Role Classification: Classifies identified persons into roles (AUTHOR, SCRIBE, PATRON)
The joint training approach achieves superior performance compared to pipeline architectures by sharing representations between the two related tasks.
Key Features
- Multi-task learning: Joint optimization of NER and classification objectives
- Domain-adapted: Fine-tuned on historical Hebrew manuscripts
- Weak supervision: Trained using distant supervision from MARC catalog records
- Resource-efficient: Trained on consumer hardware (M1 Mac) in ~1 hour
Intended Use
Extract person names and their roles from Hebrew manuscript catalog records, particularly MARC format bibliographic descriptions.
Primary applications:
- Digital humanities: Manuscript cataloging
- Library science: Automated metadata extraction
- Historical research: Person-role relationship extraction
- Linked Open Data (LOD): Converting MARC to RDF triples
Training Details
Training Data
- Source: Hebrew manuscript MARC records
- Training samples: 8,794 (after data augmentation with entity substitution)
- Validation samples: 760
- Test samples: 799
- Annotation method: Distant supervision from structured MARC fields
Training Procedure
- Base model: dicta-il/dictabert
- Architecture: Joint model with shared BERT encoder + task-specific heads
- Epochs: 5
- Batch size: 4 (with gradient accumulation)
- Learning rate: 2e-5
- Lambda (task balance): 0.5
- Optimizer: AdamW
- Training time: ~1 hour on Apple M1 Mac
- Framework: PyTorch + Transformers
Multi-Task Loss
L_total = 位 * L_NER + (1 - 位) * L_classification
Where 位=0.5 balances the two tasks equally.
Evaluation
Validation Set Performance
| Task | Metric | Score |
|---|---|---|
| NER | Precision | 88.00% |
| NER | Recall | 91.00% |
| NER | F1 Score | 89.40% |
| Classification | Accuracy | 100.00% |
Test Set Performance
| Task | Metric | Score |
|---|---|---|
| NER | Precision | 47.00% |
| NER | Recall | 81.00% |
| NER | F1 Score | 59.41% |
| Classification | Accuracy | 100.00% |
Note: The gap between validation and test F1 suggests potential overfitting to validation distribution. Future work will address this with more diverse test data.
Comparison to Baseline
| Model | Validation F1 | Improvement |
|---|---|---|
| Baseline NER | 55.64% | - |
| + CRF Layer | 84.39% | +28.75 pp |
| Joint Model (This) | 89.40% | +33.76 pp |
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "alexgoldberg/hebrew-manuscript-joint-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Example Hebrew text from manuscript catalog
text = "谞讻转讘 注诇 讬讚讬 专' 讬注拽讘 讘谉 诪砖讛"
# Tokenize
inputs = tokenizer(text, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
# Print results
for token, label in zip(tokens, labels):
if token not in ['[CLS]', '[SEP]', '[PAD]']:
print(f"{token}: {label}")
Advanced Usage: Extract Entities
def extract_entities(text, model, tokenizer):
"""Extract PERSON entities from Hebrew text"""
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
entities = []
current_entity = []
for token, label in zip(tokens, labels):
if label == 'B-PERSON':
if current_entity:
entities.append(''.join(current_entity))
current_entity = [token.replace('##', '')]
elif label == 'I-PERSON' and current_entity:
current_entity.append(token.replace('##', ''))
else:
if current_entity:
entities.append(''.join(current_entity))
current_entity = []
if current_entity:
entities.append(''.join(current_entity))
return entities
# Example
text = "讻转讘 专' 讗讘专讛诐 讘谉 讚讜讚 讜讛注转讬拽 诪砖讛 讛住讜驻专"
entities = extract_entities(text, model, tokenizer)
print("Found entities:", entities)
# Output: ['讗讘专讛诐 讘谉 讚讜讚', '诪砖讛 讛住讜驻专']
Limitations
- Domain-specific: Optimized for Hebrew manuscript catalog records; performance may degrade on other text types
- Single entity type: Only identifies PERSON entities (not PLACE, DATE, WORK, etc.)
- Role coverage: Limited to AUTHOR, SCRIBE, PATRON roles
- Historical Hebrew: Best performance on historical/rabbinical Hebrew; may underperform on modern Hebrew
- Test set gap: Validation F1 (89.40%) significantly higher than test F1 (59.41%), indicating potential overfitting
Ethical Considerations
- Bias: Training data derived from library catalogs may reflect historical biases in manuscript preservation
- Cultural sensitivity: Model handles religious and cultural content; users should apply appropriate domain expertise
- Accuracy: Not suitable for critical applications without human review
Citation
If you use this model, please cite:
@misc{goldberg2025hebrewjoint,
author = {Goldberg, Alexander},
title = {Hebrew Manuscript Joint Entity-Role Extraction Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner}
}
Contact
- Author: Alexander Goldberg
- Institution: Technion - Israel Institute of Technology
- Email: alexgoldberg@cs.technion.ac.il
- Paper: [Link to paper when published]
Acknowledgments
- Base model: DictaBERT by Dicta team
- Dataset: Hebrew manuscript MARC records from multiple libraries
- Framework: HuggingFace Transformers
License
MIT License - See LICENSE file for details.
Model Card Authors
Alexander Goldberg
Model Card Contact
- Downloads last month
- 3
Evaluation results
- Validation F1self-reported89.400
- Test F1self-reported59.410
- Validation Accuracyself-reported100.000
- Test Accuracyself-reported100.000