Modality Router (Merged) - Smart Output Modality Selection

Part of the MoM (Mixture of Models) family for vLLM Semantic Router.

This is the merged (ready-to-use) version of mmbert32k-modality-router-lora. LoRA weights have been merged into the mmbert-32k-yarn base model for easy deployment without the PEFT dependency.

A text classifier based on ModernBERT (307M params, 32K context, 1800+ languages) that determines the appropriate response modality for user prompts:

Label	Description	Routed To	Example
AR	Text-only response	Autoregressive LLM (e.g., Llama, Qwen)	"What is the capital of France?"
DIFFUSION	Image generation	Diffusion model (e.g., Flux, SDXL)	"A cyberpunk city at night, neon lights"
BOTH	Text + image response	Both AR + Diffusion pipeline	"Explain photosynthesis and show a diagram"

Quick Start

Pipeline API (simplest)

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="llm-semantic-router/mmbert32k-modality-router-merged",
)

results = classifier([
    "What are the benefits of exercise?",
    "A serene Japanese garden with cherry blossoms, watercolor style",
    "Explain how neural networks work and generate a diagram",
])

for r in results:
    print(f"{r['label']}: {r['score']:.3f}")
# AR: 0.995
# DIFFUSION: 0.717
# BOTH: 0.978

Direct Model Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    "llm-semantic-router/mmbert32k-modality-router-merged"
)
tokenizer = AutoTokenizer.from_pretrained(
    "llm-semantic-router/mmbert32k-modality-router-merged"
)

prompts = [
    "Summarize the key points of quantum computing",
    "portrait of a woman in renaissance style, oil painting, dramatic lighting",
    "Write a blog post about climate change and include relevant charts",
]

model.eval()
inputs = tokenizer(prompts, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=-1)
labels = model.config.id2label
for prompt, pred_id in zip(prompts, predictions):
    print(f"{labels[pred_id.item()]}: {prompt[:60]}...")
# AR: Summarize the key points of quantum computing...
# DIFFUSION: portrait of a woman in renaissance style, oil painting, d...
# BOTH: Write a blog post about climate change and include releva...

Integration with vLLM Semantic Router

# Example: Route requests to different model backends
def route_request(prompt: str, classifier) -> str:
    """Route a user prompt to the appropriate model backend."""
    result = classifier(prompt)[0]
    modality = result["label"]
    confidence = result["score"]

    if modality == "AR":
        return call_llm_backend(prompt)        # e.g., Llama, Qwen
    elif modality == "DIFFUSION":
        return call_diffusion_backend(prompt)   # e.g., Flux, SDXL
    else:  # BOTH
        text = call_llm_backend(prompt)
        image = call_diffusion_backend(prompt)
        return combine_response(text, image)

ONNX Runtime (for production latency)

The base model (mmbert-32k-yarn) supports ONNX export for sub-5ms inference on AMD MI300X GPUs.

Model Details

Property	Value
Base model	`llm-semantic-router/mmbert-32k-yarn` (307M params)
Architecture	ModernBERT + YaRN RoPE scaling
Context length	32,768 tokens
Languages	1800+ (Gemma 2 tokenizer, 256K vocab)
Fine-tuning	LoRA (rank=16, alpha=32) merged into base weights
Classes	3 (AR, DIFFUSION, BOTH)
Model size	~1.23 GB (safetensors)

Training Configuration

Parameter	Value
Epochs	10
Batch size	32
Learning rate	2e-5
Weight decay	0.15 (adaptive)
Loss function	Focal Loss (gamma=2.0)
Class weighting	Inverse-frequency (sqrt-dampened)
Minority oversampling	Yes
LoRA target modules	`attn.Wqkv`, `attn.Wo`, `mlp.Wi`, `mlp.Wo`
Hardware	AMD Instinct MI300X (192GB VRAM)
Training time	~2 minutes

Training Data

Trained on a curated combination of 10 public datasets covering diverse prompt styles:

DIFFUSION class

Gustavosta/Stable-Diffusion-Prompts - 80K curated SD prompts
FredZhang7/stable-diffusion-prompts-2.47M - 2.47M SD prompts
nateraw/parti-prompts - Google Parti benchmark
fal/image-generation-prompts - Diverse image prompts
allenai/WildChat (mined) - Real user image requests

AR class

OpenAssistant/oasst2 - 135K instruction conversations
tatsu-lab/alpaca - 52K Stanford instructions
databricks/databricks-dolly-15k - 15K categorized instructions
stingning/ultrachat - 1.5M multi-turn conversations
allenai/WildChat (mined) - Real user text prompts

BOTH class

mqliu/InterleavedBench - Gold-standard interleaved text+image (EMNLP 2024)
allenai/WildChat (mined) - Real user multimodal prompts
Curated seed examples (40+ across diverse domains)

Evaluation Results

Metric	Value
Accuracy	0.9686
F1 (weighted)	0.9686
Eval Loss	0.0435

Per-class Performance

Class	Precision	Recall	F1-Score
AR	0.956	0.967	0.962
DIFFUSION	0.974	0.979	0.977
BOTH	0.983	0.951	0.967

Example Classifications

Prompt	Predicted	Confidence
"What is the capital of France?"	AR	0.995
"A serene Japanese garden with cherry blossoms, watercolor style"	DIFFUSION	0.717
"Explain how neural networks work and generate a diagram"	BOTH	0.978
"Write me a poem about autumn"	AR	0.864
"cyberpunk cityscape, 4k, artstation, trending"	DIFFUSION	0.971
"Create a travel guide for Tokyo with photos of each location"	BOTH	0.935

Intended Use

This model is designed for routing LLM requests in multi-model serving systems:

Smart Output Modality Selection: Automatically determine whether a user query needs text, image, or both
Automatic Paradigm Routing: Route requests to the right backend (AR LLM vs Diffusion model vs both)
Cost Optimization: Avoid sending simple text queries to expensive image generation pipelines
Latency Reduction: Skip unnecessary model invocations by predicting the needed output type upfront

Limitations

Single-turn prompt classification only (no conversation context)
Primarily trained on English data (multilingual capability inherited from base model)
Not designed for content moderation or safety classification

Related Models

Model	Description
mmbert32k-modality-router-lora	LoRA adapter version (for further fine-tuning)
mmbert-32k-yarn	Base model (307M, 32K context, 1800+ languages)
mmbert32k-intent-classifier-merged	Intent classifier (MoM family)
mmbert32k-jailbreak-detector-merged	Jailbreak detector (MoM family)
mmbert32k-pii-detector-merged	PII detector (MoM family)

Citation

@misc{modality-router-2025,
  title={Modality Router: Smart Output Modality Selection for Multi-Model Serving},
  author={vLLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert32k-modality-router-merged}
}

Framework Versions

Transformers: 4.57.6
PyTorch: 2.9.1
Safetensors: 0.5.x
Python: 3.12

Downloads last month: 21

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for llm-semantic-router/mmbert32k-modality-router-merged

Base model

jhu-clsp/mmBERT-base

Quantized

llm-semantic-router/mmbert-32k-yarn

Finetuned

(2)

this model

Datasets used to train llm-semantic-router/mmbert32k-modality-router-merged

Evaluation results

Accuracy
self-reported

0.969
F1 (weighted)
self-reported

0.969
F1 AR
self-reported

0.962
F1 DIFFUSION
self-reported

0.977
F1 BOTH
self-reported

0.967