NetraEmbed

NetraEmbed Banner

Paper GitHub Model Blog Demo

NetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval with Matryoshka representation learning, powered by the Gemma3 backbone.

Model Description

NetraEmbed is a multilingual multimodal embedding model that encodes both visual documents and text queries into single dense vectors. It supports multiple languages and enables efficient similarity search at multiple embedding dimensions (768, 1536, 2560) through Matryoshka representation learning.

  • Model Type: Multilingual Multimodal Embedding Model with Matryoshka embeddings
  • Architecture: BiEncoder with Gemma3-4B backbone
  • Embedding Dimensions: 768, 1536, 2560 (Matryoshka)
  • Capabilities: Multilingual, Multimodal (Vision + Text)
  • Use Case: Visual document retrieval, multilingual semantic search, cross-lingual document understanding

Paper

๐Ÿ“„ M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Installation

pip install git+https://github.com/adithya-s-k/colpali.git

Quick Start

import torch
from PIL import Image
from colpali_engine.models import BiGemma3, BiGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/NetraEmbed"

# Load model once (supports all Matryoshka dimensions)
model = BiGemma3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = BiGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_texts(queries).to(model.device)

# Choose embedding dimension at inference time: 768, 1536, or 2560
# Use lower dims for faster search, higher for better accuracy
embedding_dim = 1536  # Balanced performance

with torch.no_grad():
    image_embeddings = model(**batch_images, embedding_dim=embedding_dim)  # Shape: (num_images, embedding_dim)
    query_embeddings = model(**batch_queries, embedding_dim=embedding_dim)  # Shape: (num_queries, embedding_dim)

# Compute similarity scores using cosine similarity
scores = processor.score(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.4f})")

Testing Multiple Dimensions

You can test different embedding dimensions without reloading the model:

# Load model once
model = BiGemma3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

# Test all Matryoshka dimensions
for embedding_dim in [768, 1536, 2560]:
    print(f"\nTesting dimension: {embedding_dim}")

    with torch.no_grad():
        image_embeddings = model(**batch_images, embedding_dim=embedding_dim)
        query_embeddings = model(**batch_queries, embedding_dim=embedding_dim)

    scores = processor.score(qs=query_embeddings, ps=image_embeddings)
    print(f"Scores shape: {scores.shape}")
    print(f"Best match score: {scores.max().item():.4f}")

Matryoshka Embeddings

NetraEmbed supports three embedding dimensions that can be selected at inference time:

Dimension Use Case Speed Accuracy
768 Fast search, large-scale โšกโšกโšก โญโญ
1536 Balanced performance โšกโšก โญโญโญ
2560 Maximum accuracy โšก โญโญโญโญ

Key Advantage: Load the model once and dynamically choose dimensions at inference time. No need to reload the model to test different dimensions or switch between accuracy/speed trade-offs!

Use Cases

  • Efficient Document Retrieval: Fast search through millions of documents
  • Semantic Search: Find visually similar documents
  • Scalable Vector Search: Works with FAISS, Milvus, Pinecone, etc.
  • Cross-lingual Retrieval: Multilingual visual document search

Model Details

  • Base Model: Gemma3-4B-IT
  • Vision Encoder: SigLIP
  • Training Data: Multilingual document datasets
  • Embedding Strategy: Single-vector (BiEncoder)
  • Similarity Function: Cosine similarity
  • Matryoshka Dimensions: 768, 1536, 2560

Performance

NetraEmbed achieves state-of-the-art performance on multilingual document retrieval benchmarks. Evaluated on Nayana-IR Bench (22 languages) and ViDoRe v2.

Benchmark Results

Nayana-IR Cross-Lingual

Model NDCG@5 Recall@10 MAP@10 MRR@10
NetraEmbed 0.716 0.871 0.703 0.775
Jina-Embeddings-v4 0.435 0.435 0.390 0.548
ColNomic-Embed-3B 0.315 0.320 0.267 0.444
ColPali-v1.3 0.284 0.347 0.249 0.403
GME-Qwen2-VL-2B 0.235 0.308 0.209 0.314
ColQwen2.5-v0.2 0.143 0.160 0.127 0.220
ColQwen2-v1.0 0.050 0.065 0.038 0.109

Nayana-IR Monolingual

Model NDCG@5 Recall@10 MAP@10 MRR@10
NetraEmbed 0.738 0.844 0.709 0.751
ColNomic-Embed-3B 0.534 0.603 0.515 0.546
ColQwen2.5-v0.2 0.453 0.513 0.437 0.464
GME-Qwen2-VL-2B 0.444 0.525 0.426 0.452
ColQwen2-v1.0 0.413 0.466 0.398 0.422
ColPali-v1.3 0.410 0.484 0.393 0.422

ViDoRe v2

Model NDCG@5 Recall@10 MAP@10 MRR@10
ColQwen2.5-v0.2 0.592 0.664 0.484 0.711
Jina-Embeddings-v4 0.576 0.686 - -
GME-Qwen2-VL-2B 0.574 0.630 0.466 0.690
ColNomic-Embed-3B 0.556 0.633 0.451 0.672
NetraEmbed 0.554 0.637 0.437 0.647
ColQwen2-v1.0 0.545 0.640 0.438 0.653
ColPali-v1.3 0.538 0.627 0.436 0.644

Key Results:

  • ๐Ÿ† State-of-the-art on multilingual retrieval (0.716 NDCG@5 cross-lingual)
  • ๐Ÿ“ˆ 152% improvement over ColPali-v1.3 on cross-lingual tasks
  • ๐ŸŒ Consistent performance across 22 languages and diverse scripts
  • โšก 250x more efficient than multi-vector approaches (~10KB vs ~2.5MB per document)

See our paper for comprehensive evaluation and per-language analysis.

Citation

@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}

License

This model is released under the same license as the base Gemma3 model.

Acknowledgments

This work benefited from compute credits for training, inference, and evaluation provided by Modal, acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the Meta LLaMA Impact Grant through our Nayana initiative. We appreciate Meta for continued support of our research efforts at CognitiveLab.

Built on top of the Gemma3 architecture with Matryoshka representation learning.

Downloads last month
169
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Cognitive-Lab/NetraEmbed

Finetuned
(432)
this model

Evaluation results