Invoice Field Extraction Model

Overview

This is a DistilBERT-based token classification model fine-tuned for extracting key fields from invoices. The model performs Named Entity Recognition (NER) to identify and extract:

  • Invoice metadata: Invoice number, date
  • Customer information: Name, address
  • Financial details: Total price, individual item prices
  • Line items: Product/service names

The model achieves 100% precision and recall on the validation set, making it highly accurate for invoice field extraction tasks.

Model Details

Model Architecture

  • Base Model: DistilBERT (distilbert-base-uncased)
  • Task: Token Classification (Named Entity Recognition)
  • Model Size: ~265 MB (safetensors format)
  • Architecture Parameters:
    • Hidden dimension: 768
    • Number of layers: 6
    • Number of attention heads: 12
    • Maximum sequence length: 512 tokens
    • Vocabulary size: 30,522

Training Configuration

  • Framework: PyTorch with Hugging Face Transformers
  • Optimizer: AdamW with linear warmup
  • Training Epochs: 10
  • Learning Rate: 2e-5
  • Batch Size: 8
  • Warmup Steps: 500
  • Weight Decay: 0.01

Performance

Overall Metrics (Epoch 10 - Final Model)

Metric Score
Precision 100%
Recall 100%
F1-Score 1.00
Exact Match Accuracy 100%

Per-Entity Performance

Entity Type Precision Recall F1-Score
INVOICE_NUMBER 100% 100% 1.00
DATE 100% 100% 1.00
CUSTOMER_NAME 100% 100% 1.00
CUSTOMER_ADDRESS 100% 100% 1.00
ITEM 100% 100% 1.00
ITEM_PRICE 100% 100% 1.00
TOTAL 100% 100% 1.00

Training History

The model showed strong convergence:

  • Epoch 1: F1 = 0.055 (initial learning)
  • Epoch 7: F1 = 0.665 (rapid improvement phase)
  • Epoch 8: F1 = 0.957 (near-optimal performance)
  • Epoch 9: F1 = 0.998 (convergence)
  • Epoch 10: F1 = 1.000 (final model)

Label Schema

The model uses BIO (Begin-Inside-Outside) tagging with the following labels:

Label ID Label Description
0 O Outside - not part of any entity
1 B-TOTAL Beginning of total price
2 I-TOTAL Inside total price
3 B-ITEM Beginning of item/product name
4 I-ITEM Inside item/product name
5 B-ITEM_PRICE Beginning of item price
6 I-ITEM_PRICE Inside item price
7 B-CUSTOMER_NAME Beginning of customer name
8 I-CUSTOMER_NAME Inside customer name
9 B-CUSTOMER_ADDRESS Beginning of customer address
10 I-CUSTOMER_ADDRESS Inside customer address
11 B-DATE Beginning of invoice date
12 I-DATE Inside invoice date
13 B-INVOICE_NUMBER Beginning of invoice number
14 I-INVOICE_NUMBER Inside invoice number

Dataset

  • Source: Company Documents Dataset (Kaggle)
  • Document Type: Invoice PDFs
  • Annotation Method:
    • Automatic annotation using regex patterns
    • BIO format for token-level labels
    • Manual review and validation
  • Train/Val/Test Split: 70/15/15

How to Use

Basic Inference

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "path/to/final_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example invoice text
text = "Invoice INV-2024-001 dated January 15, 2024 for John Smith at 123 Main St. Items: Laptop $999.99, Mouse $29.99. Total: $1,029.98"

# Tokenize and predict
tokens = text.split()
encoding = tokenizer(
    tokens,
    is_split_into_words=True,
    max_length=512,
    truncation=True,
    padding="max_length",
    return_tensors="pt"
)

with torch.no_grad():
    outputs = model(**encoding)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Get label names
id2label = model.config.id2label
predicted_labels = [id2label[pred.item()] for pred in predictions[0]]

Using the Invoice Extractor Wrapper

For easier use with invoice extraction, use the InvoiceExtractor class from the training package:

from pathlib import Path
from training.inference import InvoiceExtractor

# Initialize extractor
extractor = InvoiceExtractor(Path("models/final_model"))

# Extract from text
result = extractor.extract_from_text(invoice_text)
formatted_result = extractor.format_output(result)

print(formatted_result)
# Output:
# {
#   "invoice_number": "INV-2024-001",
#   "date": "January 15, 2024",
#   "customer": {
#     "name": "John Smith",
#     "address": "123 Main St"
#   },
#   "items": [
#     {"name": "Laptop", "price": "$999.99"},
#     {"name": "Mouse", "price": "$29.99"}
#   ],
#   "total": "$1,029.98"
# }

Batch Processing

texts = [
    "Invoice INV-001...",
    "Invoice INV-002...",
]

results = extractor.extract_batch(texts, batch_size=8)

Hardware Requirements

Minimum Requirements

  • RAM: 4 GB
  • Disk: 300 MB (model + dependencies)
  • GPU: Optional (recommended for large batches)

Recommended Configuration

  • RAM: 8 GB+
  • GPU: NVIDIA GPU with 2GB+ VRAM (CUDA) or Apple Silicon (MPS)
  • Disk: 500 MB SSD

Inference Speed

  • CPU: ~100-200 ms per document
  • GPU (CUDA): ~20-50 ms per document
  • GPU (Apple Silicon MPS): ~30-80 ms per document

Limitations

  • Model trained on business invoices; may not generalize well to other document types
  • Maximum sequence length is 512 tokens (~2,000 words)
  • Requires preprocessing to extract text from PDF/image format
  • Assumes documents are in English
  • Best performance with clearly formatted invoices

Intended Use

Primary Use Cases

  • Automated invoice processing
  • Document data extraction pipelines
  • Business process automation
  • Accounts payable automation
  • Invoice digitization

Out-of-Scope

  • General document understanding
  • Multilingual document processing
  • Scanned/OCR documents (without OCR preprocessing)
  • Real-time video/image processing

Training Details

Data Preprocessing

  1. PDF text extraction using PyMuPDF
  2. Regex-based pattern matching for entity identification
  3. BIO format conversion for token classification
  4. Train/validation/test splitting
  5. Tokenization using DistilBERT tokenizer

Optimization

  • Mixed precision training (FP16) when available
  • Gradient accumulation for larger effective batch sizes
  • Dynamic padding for efficient memory usage
  • Learning rate warmup for stable training

Hardware Used

  • Developed on Apple Silicon (M3 Max with 64GB RAM)
  • Training time: ~1-3 minutes per epoch
  • Total training time: ~10-30 minutes for 10 epochs

Evaluation Methodology

  • Metric: Token-level precision, recall, and F1-score
  • Validation Set Size: 15% of dataset
  • Evaluation Frequency: Per epoch
  • Best Model Selection: Highest F1-score on validation set

License

This model is released under the MIT License. See LICENSE file for details.

Citation

If you use this model in your research or application, please cite:

@model{invoice_extractor_2024,
  title={Invoice Field Extraction with DistilBERT},
  year={2024}
}

For the base DistilBERT model, please also cite:

@article{sanh2019distilbert,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  journal={arXiv preprint arXiv:1910.01108},
  year={2019}
}

Repository

For more information, training code, and examples, visit: Invoice Model Training Repository

Contact & Support

For issues, questions, or contributions, please open an issue on the repository.


Model Card Generated: October 2024 Framework Version: PyTorch 2.x / Transformers 4.57.1 Compatible With: Hugging Face Transformers, ONNX, TensorFlow 2.x (with conversion)

Downloads last month
3
Safetensors
Model size
66.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support