language:
- en
license: gemma
library_name: transformers
tags:
- vision-language
- retrieval
- colbert
- late-interaction
pipeline_tag: visual-document-retrieval
base_model:
- google/gemma-3-4b-it
ColNetraEmbed
ColNetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.
Model Description
ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).
- Model Type: Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
- Architecture: ColPali with Gemma3-2B backbone
- Embedding Dimension: 128 per token
- Capabilities: Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
- Use Case: Visual document retrieval, multilingual document understanding, fine-grained visual search
Paper
📄 M3DR: Towards Universal Multilingual Multimodal Document Retrieval
Installation
pip install git+https://github.com/adithya-s-k/colpali.git
Quick Start
import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3
# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
model_name,
dtype=torch.bfloat16,
device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)
# Load your images
images = [
Image.open("document1.jpg"),
Image.open("document2.jpg"),
]
# Define queries
queries = [
"What is the total revenue?",
"Show me the organizational chart",
]
# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128)
query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128)
# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
qs=query_embeddings,
ps=image_embeddings,
) # Shape: (num_queries, num_images)
# Get best matches
for i, query in enumerate(queries):
best_idx = scores[i].argmax().item()
print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")
Use Cases
- Document Retrieval: Search through large collections of visual documents
- Visual Question Answering: Answer questions about document content
- Document Understanding: Extract and match information from scanned documents
- Cross-lingual Document Search: Multilingual visual document retrieval
Model Details
- Base Model: Gemma3-2B
- Vision Encoder: SigLIP
- Training Data: Multilingual document datasets
- Embedding Strategy: Multi-vector (Late Interaction)
- Similarity Function: MaxSim (Maximum Similarity)
Performance
ColNetraEmbed achieves state-of-the-art results on visual document retrieval benchmarks. See our paper for detailed evaluation metrics.
Citation
@misc{kolavi2025m3druniversalmultilingualmultimodal,
title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval},
author={Adithya S Kolavi and Vyoman Jain},
year={2025},
eprint={2512.03514},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2512.03514}
}
License
This model is released under the same license as the base Gemma3 model.
Acknowledgments
Built on top of the ColPali framework and Gemma3 architecture.