mmBERT-32K PII Detector (Merged)

Full merged model for PII (Personally Identifiable Information) detection, ready for direct inference. Based on mmBERT-32K-YaRN with 32K context length support.

Model Details

Property Value
Base Model llm-semantic-router/mmbert-32k-yarn
Architecture ModernBERT (Flash Attention 2)
Parameters 307M
Task Token Classification (NER)
Max Context 32,768 tokens
Entity Types 17 PII types (35 BIO labels)

Supported PII Types

  • PERSON - Person names (98.7% accuracy)
  • EMAIL_ADDRESS - Email addresses (95%+ accuracy)
  • PHONE_NUMBER - Phone numbers (99.1% accuracy)
  • STREET_ADDRESS - Street addresses (95.9% accuracy)
  • CREDIT_CARD - Credit card numbers (84% accuracy)
  • US_SSN - US Social Security Numbers
  • US_DRIVER_LICENSE - US Driver License numbers
  • IBAN_CODE - International Bank Account Numbers
  • IP_ADDRESS - IP addresses
  • DATE_TIME - Dates and times
  • AGE - Age information
  • ORGANIZATION - Organization names
  • GPE - Geopolitical entities
  • ZIP_CODE - ZIP/postal codes
  • DOMAIN_NAME - Domain names
  • NRP - Nationalities, religious or political groups
  • TITLE - Titles (Mr., Dr., etc.)

Training

  • Dataset: Microsoft Presidio research dataset
  • Epochs: 5
  • Batch Size: 16
  • Learning Rate: 1e-4
  • LoRA Rank: 32 (merged into full model)

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained(
    "llm-semantic-router/mmbert32k-pii-detector-merged"
)
tokenizer = AutoTokenizer.from_pretrained(
    "llm-semantic-router/mmbert32k-pii-detector-merged"
)

text = "My email is john.smith@example.com and phone is 555-123-4567"
inputs = tokenizer(text, return_tensors="pt", truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

# Get label mapping
id2label = model.config.id2label
for token, pred in zip(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]), predictions[0]):
    label = id2label[str(pred.item())]
    if label != "O":
        print(f"{token}: {label}")

Part of vLLM Semantic Router

This model is part of the vLLM Semantic Router Mixture-of-Models (MoM) family for intelligent LLM request routing.

License

MIT License

Downloads last month
774
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mmbert32k-pii-detector-merged

Quantized
(6)
this model

Dataset used to train llm-semantic-router/mmbert32k-pii-detector-merged

Collection including llm-semantic-router/mmbert32k-pii-detector-merged