Hebrew Manuscript Joint Entity-Role Extraction

Fine-tuned DictaBERT model for joint Named Entity Recognition (NER) and Role Classification on Hebrew manuscript MARC records.

Model Description

This model performs two tasks simultaneously:

Named Entity Recognition: Identifies PERSON entities in Hebrew text
Role Classification: Classifies identified persons into roles (AUTHOR, SCRIBE, PATRON)

The joint training approach achieves superior performance compared to pipeline architectures by sharing representations between the two related tasks.

Key Features

Multi-task learning: Joint optimization of NER and classification objectives
Domain-adapted: Fine-tuned on historical Hebrew manuscripts
Weak supervision: Trained using distant supervision from MARC catalog records
Resource-efficient: Trained on consumer hardware (M1 Mac) in ~1 hour

Intended Use

Extract person names and their roles from Hebrew manuscript catalog records, particularly MARC format bibliographic descriptions.

Primary applications:

Digital humanities: Manuscript cataloging
Library science: Automated metadata extraction
Historical research: Person-role relationship extraction
Linked Open Data (LOD): Converting MARC to RDF triples

Training Details

Training Data

Source: Hebrew manuscript MARC records
Training samples: 8,794 (after data augmentation with entity substitution)
Validation samples: 760
Test samples: 799
Annotation method: Distant supervision from structured MARC fields

Training Procedure

Base model: dicta-il/dictabert
Architecture: Joint model with shared BERT encoder + task-specific heads
Epochs: 5
Batch size: 4 (with gradient accumulation)
Learning rate: 2e-5
Lambda (task balance): 0.5
Optimizer: AdamW
Training time: ~1 hour on Apple M1 Mac
Framework: PyTorch + Transformers

Multi-Task Loss

L_total = λ * L_NER + (1 - λ) * L_classification

Where λ=0.5 balances the two tasks equally.

Evaluation

Validation Set Performance

Task	Metric	Score
NER	Precision	88.00%
NER	Recall	91.00%
NER	F1 Score	89.40%
Classification	Accuracy	100.00%

Test Set Performance

Task	Metric	Score
NER	Precision	47.00%
NER	Recall	81.00%
NER	F1 Score	59.41%
Classification	Accuracy	100.00%

Note: The gap between validation and test F1 suggests potential overfitting to validation distribution. Future work will address this with more diverse test data.

Comparison to Baseline

Model	Validation F1	Improvement
Baseline NER	55.64%	-
+ CRF Layer	84.39%	+28.75 pp
Joint Model (This)	89.40%	+33.76 pp

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "alexgoldberg/hebrew-manuscript-joint-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example Hebrew text from manuscript catalog
text = "נכתב על ידי ר' יעקב בן משה"

# Tokenize
inputs = tokenizer(text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

# Print results
for token, label in zip(tokens, labels):
    if token not in ['[CLS]', '[SEP]', '[PAD]']:
        print(f"{token}: {label}")

Advanced Usage: Extract Entities

def extract_entities(text, model, tokenizer):
    """Extract PERSON entities from Hebrew text"""
    inputs = tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[p.item()] for p in predictions[0]]
    
    entities = []
    current_entity = []
    
    for token, label in zip(tokens, labels):
        if label == 'B-PERSON':
            if current_entity:
                entities.append(''.join(current_entity))
            current_entity = [token.replace('##', '')]
        elif label == 'I-PERSON' and current_entity:
            current_entity.append(token.replace('##', ''))
        else:
            if current_entity:
                entities.append(''.join(current_entity))
                current_entity = []
    
    if current_entity:
        entities.append(''.join(current_entity))
    
    return entities

# Example
text = "כתב ר' אברהם בן דוד והעתיק משה הסופר"
entities = extract_entities(text, model, tokenizer)
print("Found entities:", entities)
# Output: ['אברהם בן דוד', 'משה הסופר']

Limitations

Domain-specific: Optimized for Hebrew manuscript catalog records; performance may degrade on other text types
Single entity type: Only identifies PERSON entities (not PLACE, DATE, WORK, etc.)
Role coverage: Limited to AUTHOR, SCRIBE, PATRON roles
Historical Hebrew: Best performance on historical/rabbinical Hebrew; may underperform on modern Hebrew
Test set gap: Validation F1 (89.40%) significantly higher than test F1 (59.41%), indicating potential overfitting

Ethical Considerations

Bias: Training data derived from library catalogs may reflect historical biases in manuscript preservation
Cultural sensitivity: Model handles religious and cultural content; users should apply appropriate domain expertise
Accuracy: Not suitable for critical applications without human review

Citation

If you use this model, please cite:

@misc{goldberg2025hebrewjoint,
  author = {Goldberg, Alexander},
  title = {Hebrew Manuscript Joint Entity-Role Extraction Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner}
}

Contact

Author: Alexander Goldberg
Institution: Technion - Israel Institute of Technology
Email: alexgoldberg@cs.technion.ac.il
Paper: [Link to paper when published]

Acknowledgments

Base model: DictaBERT by Dicta team
Dataset: Hebrew manuscript MARC records from multiple libraries
Framework: HuggingFace Transformers

License

MIT License - See LICENSE file for details.

Model Card Authors

Alexander Goldberg

Model Card Contact

alexgoldberg@cs.technion.ac.il

Downloads last month: 3

Evaluation results

Validation F1
self-reported

89.400
Test F1
self-reported

59.410
Validation Accuracy
self-reported

100.000
Test Accuracy
self-reported

100.000

Metadata error: specify a dataset to view leaderboard