Invoice Field Extraction Model

Overview

This is a DistilBERT-based token classification model fine-tuned for extracting key fields from invoices. The model performs Named Entity Recognition (NER) to identify and extract:

Invoice metadata: Invoice number, date
Customer information: Name, address
Financial details: Total price, individual item prices
Line items: Product/service names

The model achieves 100% precision and recall on the validation set, making it highly accurate for invoice field extraction tasks.

Model Details

Model Architecture

Base Model: DistilBERT (distilbert-base-uncased)
Task: Token Classification (Named Entity Recognition)
Model Size: ~265 MB (safetensors format)
Architecture Parameters:
- Hidden dimension: 768
- Number of layers: 6
- Number of attention heads: 12
- Maximum sequence length: 512 tokens
- Vocabulary size: 30,522

Training Configuration

Framework: PyTorch with Hugging Face Transformers
Optimizer: AdamW with linear warmup
Training Epochs: 10
Learning Rate: 2e-5
Batch Size: 8
Warmup Steps: 500
Weight Decay: 0.01

Performance

Overall Metrics (Epoch 10 - Final Model)

Metric	Score
Precision	100%
Recall	100%
F1-Score	1.00
Exact Match Accuracy	100%

Per-Entity Performance

Entity Type	Precision	Recall	F1-Score
INVOICE_NUMBER	100%	100%	1.00
DATE	100%	100%	1.00
CUSTOMER_NAME	100%	100%	1.00
CUSTOMER_ADDRESS	100%	100%	1.00
ITEM	100%	100%	1.00
ITEM_PRICE	100%	100%	1.00
TOTAL	100%	100%	1.00

Training History

The model showed strong convergence:

Epoch 1: F1 = 0.055 (initial learning)
Epoch 7: F1 = 0.665 (rapid improvement phase)
Epoch 8: F1 = 0.957 (near-optimal performance)
Epoch 9: F1 = 0.998 (convergence)
Epoch 10: F1 = 1.000 (final model)

Label Schema

The model uses BIO (Begin-Inside-Outside) tagging with the following labels:

Label ID	Label	Description
0	O	Outside - not part of any entity
1	B-TOTAL	Beginning of total price
2	I-TOTAL	Inside total price
3	B-ITEM	Beginning of item/product name
4	I-ITEM	Inside item/product name
5	B-ITEM_PRICE	Beginning of item price
6	I-ITEM_PRICE	Inside item price
7	B-CUSTOMER_NAME	Beginning of customer name
8	I-CUSTOMER_NAME	Inside customer name
9	B-CUSTOMER_ADDRESS	Beginning of customer address
10	I-CUSTOMER_ADDRESS	Inside customer address
11	B-DATE	Beginning of invoice date
12	I-DATE	Inside invoice date
13	B-INVOICE_NUMBER	Beginning of invoice number
14	I-INVOICE_NUMBER	Inside invoice number

Dataset

Source: Company Documents Dataset (Kaggle)
Document Type: Invoice PDFs
Annotation Method:
- Automatic annotation using regex patterns
- BIO format for token-level labels
- Manual review and validation
Train/Val/Test Split: 70/15/15

How to Use

Basic Inference

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "path/to/final_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example invoice text
text = "Invoice INV-2024-001 dated January 15, 2024 for John Smith at 123 Main St. Items: Laptop $999.99, Mouse $29.99. Total: $1,029.98"

# Tokenize and predict
tokens = text.split()
encoding = tokenizer(
    tokens,
    is_split_into_words=True,
    max_length=512,
    truncation=True,
    padding="max_length",
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**encoding)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Get label names
id2label = model.config.id2label
predicted_labels = [id2label[pred.item()] for pred in predictions[0]]

Using the Invoice Extractor Wrapper

For easier use with invoice extraction, use the InvoiceExtractor class from the training package:

from pathlib import Path
from training.inference import InvoiceExtractor

# Initialize extractor
extractor = InvoiceExtractor(Path("models/final_model"))

# Extract from text
result = extractor.extract_from_text(invoice_text)
formatted_result = extractor.format_output(result)

print(formatted_result)
# Output:
# {
#   "invoice_number": "INV-2024-001",
#   "date": "January 15, 2024",
#   "customer": {
#     "name": "John Smith",
#     "address": "123 Main St"
#   },
#   "items": [
#     {"name": "Laptop", "price": "$999.99"},
#     {"name": "Mouse", "price": "$29.99"}
#   ],
#   "total": "$1,029.98"
# }

Batch Processing

texts = [
    "Invoice INV-001...",
    "Invoice INV-002...",
]

results = extractor.extract_batch(texts, batch_size=8)

Hardware Requirements

Minimum Requirements

RAM: 4 GB
Disk: 300 MB (model + dependencies)
GPU: Optional (recommended for large batches)

Recommended Configuration

RAM: 8 GB+
GPU: NVIDIA GPU with 2GB+ VRAM (CUDA) or Apple Silicon (MPS)
Disk: 500 MB SSD

Inference Speed

CPU: ~100-200 ms per document
GPU (CUDA): ~20-50 ms per document
GPU (Apple Silicon MPS): ~30-80 ms per document

Limitations

Model trained on business invoices; may not generalize well to other document types
Maximum sequence length is 512 tokens (~2,000 words)
Requires preprocessing to extract text from PDF/image format
Assumes documents are in English
Best performance with clearly formatted invoices

Intended Use

Primary Use Cases

Automated invoice processing
Document data extraction pipelines
Business process automation
Accounts payable automation
Invoice digitization

Out-of-Scope

General document understanding
Multilingual document processing
Scanned/OCR documents (without OCR preprocessing)
Real-time video/image processing

Training Details

Data Preprocessing

PDF text extraction using PyMuPDF
Regex-based pattern matching for entity identification
BIO format conversion for token classification
Train/validation/test splitting
Tokenization using DistilBERT tokenizer

Optimization

Mixed precision training (FP16) when available
Gradient accumulation for larger effective batch sizes
Dynamic padding for efficient memory usage
Learning rate warmup for stable training

Hardware Used

Developed on Apple Silicon (M3 Max with 64GB RAM)
Training time: ~1-3 minutes per epoch
Total training time: ~10-30 minutes for 10 epochs

Evaluation Methodology

Metric: Token-level precision, recall, and F1-score
Validation Set Size: 15% of dataset
Evaluation Frequency: Per epoch
Best Model Selection: Highest F1-score on validation set

License

This model is released under the MIT License. See LICENSE file for details.

Citation

If you use this model in your research or application, please cite:

@model{invoice_extractor_2024,
  title={Invoice Field Extraction with DistilBERT},
  year={2024}
}

For the base DistilBERT model, please also cite:

@article{sanh2019distilbert,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  journal={arXiv preprint arXiv:1910.01108},
  year={2019}
}

Repository

For more information, training code, and examples, visit: Invoice Model Training Repository

Contact & Support

For issues, questions, or contributions, please open an issue on the repository.

Model Card Generated: October 2024 Framework Version: PyTorch 2.x / Transformers 4.57.1 Compatible With: Hugging Face Transformers, ONNX, TensorFlow 2.x (with conversion)

Downloads last month: 3

Safetensors

Model size

66.4M params

Tensor type

F32