Invoice Field Extraction Model
Overview
This is a DistilBERT-based token classification model fine-tuned for extracting key fields from invoices. The model performs Named Entity Recognition (NER) to identify and extract:
- Invoice metadata: Invoice number, date
- Customer information: Name, address
- Financial details: Total price, individual item prices
- Line items: Product/service names
The model achieves 100% precision and recall on the validation set, making it highly accurate for invoice field extraction tasks.
Model Details
Model Architecture
- Base Model: DistilBERT (distilbert-base-uncased)
- Task: Token Classification (Named Entity Recognition)
- Model Size: ~265 MB (safetensors format)
- Architecture Parameters:
- Hidden dimension: 768
- Number of layers: 6
- Number of attention heads: 12
- Maximum sequence length: 512 tokens
- Vocabulary size: 30,522
Training Configuration
- Framework: PyTorch with Hugging Face Transformers
- Optimizer: AdamW with linear warmup
- Training Epochs: 10
- Learning Rate: 2e-5
- Batch Size: 8
- Warmup Steps: 500
- Weight Decay: 0.01
Performance
Overall Metrics (Epoch 10 - Final Model)
| Metric | Score |
|---|---|
| Precision | 100% |
| Recall | 100% |
| F1-Score | 1.00 |
| Exact Match Accuracy | 100% |
Per-Entity Performance
| Entity Type | Precision | Recall | F1-Score |
|---|---|---|---|
| INVOICE_NUMBER | 100% | 100% | 1.00 |
| DATE | 100% | 100% | 1.00 |
| CUSTOMER_NAME | 100% | 100% | 1.00 |
| CUSTOMER_ADDRESS | 100% | 100% | 1.00 |
| ITEM | 100% | 100% | 1.00 |
| ITEM_PRICE | 100% | 100% | 1.00 |
| TOTAL | 100% | 100% | 1.00 |
Training History
The model showed strong convergence:
- Epoch 1: F1 = 0.055 (initial learning)
- Epoch 7: F1 = 0.665 (rapid improvement phase)
- Epoch 8: F1 = 0.957 (near-optimal performance)
- Epoch 9: F1 = 0.998 (convergence)
- Epoch 10: F1 = 1.000 (final model)
Label Schema
The model uses BIO (Begin-Inside-Outside) tagging with the following labels:
| Label ID | Label | Description |
|---|---|---|
| 0 | O | Outside - not part of any entity |
| 1 | B-TOTAL | Beginning of total price |
| 2 | I-TOTAL | Inside total price |
| 3 | B-ITEM | Beginning of item/product name |
| 4 | I-ITEM | Inside item/product name |
| 5 | B-ITEM_PRICE | Beginning of item price |
| 6 | I-ITEM_PRICE | Inside item price |
| 7 | B-CUSTOMER_NAME | Beginning of customer name |
| 8 | I-CUSTOMER_NAME | Inside customer name |
| 9 | B-CUSTOMER_ADDRESS | Beginning of customer address |
| 10 | I-CUSTOMER_ADDRESS | Inside customer address |
| 11 | B-DATE | Beginning of invoice date |
| 12 | I-DATE | Inside invoice date |
| 13 | B-INVOICE_NUMBER | Beginning of invoice number |
| 14 | I-INVOICE_NUMBER | Inside invoice number |
Dataset
- Source: Company Documents Dataset (Kaggle)
- Document Type: Invoice PDFs
- Annotation Method:
- Automatic annotation using regex patterns
- BIO format for token-level labels
- Manual review and validation
- Train/Val/Test Split: 70/15/15
How to Use
Basic Inference
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "path/to/final_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Example invoice text
text = "Invoice INV-2024-001 dated January 15, 2024 for John Smith at 123 Main St. Items: Laptop $999.99, Mouse $29.99. Total: $1,029.98"
# Tokenize and predict
tokens = text.split()
encoding = tokenizer(
tokens,
is_split_into_words=True,
max_length=512,
truncation=True,
padding="max_length",
return_tensors="pt"
)
with torch.no_grad():
outputs = model(**encoding)
predictions = torch.argmax(outputs.logits, dim=-1)
# Get label names
id2label = model.config.id2label
predicted_labels = [id2label[pred.item()] for pred in predictions[0]]
Using the Invoice Extractor Wrapper
For easier use with invoice extraction, use the InvoiceExtractor class from the training package:
from pathlib import Path
from training.inference import InvoiceExtractor
# Initialize extractor
extractor = InvoiceExtractor(Path("models/final_model"))
# Extract from text
result = extractor.extract_from_text(invoice_text)
formatted_result = extractor.format_output(result)
print(formatted_result)
# Output:
# {
# "invoice_number": "INV-2024-001",
# "date": "January 15, 2024",
# "customer": {
# "name": "John Smith",
# "address": "123 Main St"
# },
# "items": [
# {"name": "Laptop", "price": "$999.99"},
# {"name": "Mouse", "price": "$29.99"}
# ],
# "total": "$1,029.98"
# }
Batch Processing
texts = [
"Invoice INV-001...",
"Invoice INV-002...",
]
results = extractor.extract_batch(texts, batch_size=8)
Hardware Requirements
Minimum Requirements
- RAM: 4 GB
- Disk: 300 MB (model + dependencies)
- GPU: Optional (recommended for large batches)
Recommended Configuration
- RAM: 8 GB+
- GPU: NVIDIA GPU with 2GB+ VRAM (CUDA) or Apple Silicon (MPS)
- Disk: 500 MB SSD
Inference Speed
- CPU: ~100-200 ms per document
- GPU (CUDA): ~20-50 ms per document
- GPU (Apple Silicon MPS): ~30-80 ms per document
Limitations
- Model trained on business invoices; may not generalize well to other document types
- Maximum sequence length is 512 tokens (~2,000 words)
- Requires preprocessing to extract text from PDF/image format
- Assumes documents are in English
- Best performance with clearly formatted invoices
Intended Use
Primary Use Cases
- Automated invoice processing
- Document data extraction pipelines
- Business process automation
- Accounts payable automation
- Invoice digitization
Out-of-Scope
- General document understanding
- Multilingual document processing
- Scanned/OCR documents (without OCR preprocessing)
- Real-time video/image processing
Training Details
Data Preprocessing
- PDF text extraction using PyMuPDF
- Regex-based pattern matching for entity identification
- BIO format conversion for token classification
- Train/validation/test splitting
- Tokenization using DistilBERT tokenizer
Optimization
- Mixed precision training (FP16) when available
- Gradient accumulation for larger effective batch sizes
- Dynamic padding for efficient memory usage
- Learning rate warmup for stable training
Hardware Used
- Developed on Apple Silicon (M3 Max with 64GB RAM)
- Training time: ~1-3 minutes per epoch
- Total training time: ~10-30 minutes for 10 epochs
Evaluation Methodology
- Metric: Token-level precision, recall, and F1-score
- Validation Set Size: 15% of dataset
- Evaluation Frequency: Per epoch
- Best Model Selection: Highest F1-score on validation set
License
This model is released under the MIT License. See LICENSE file for details.
Citation
If you use this model in your research or application, please cite:
@model{invoice_extractor_2024,
title={Invoice Field Extraction with DistilBERT},
year={2024}
}
For the base DistilBERT model, please also cite:
@article{sanh2019distilbert,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
journal={arXiv preprint arXiv:1910.01108},
year={2019}
}
Repository
For more information, training code, and examples, visit: Invoice Model Training Repository
Contact & Support
For issues, questions, or contributions, please open an issue on the repository.
Model Card Generated: October 2024 Framework Version: PyTorch 2.x / Transformers 4.57.1 Compatible With: Hugging Face Transformers, ONNX, TensorFlow 2.x (with conversion)
- Downloads last month
- 3