ColNetraEmbed / README.md
AdithyaSK's picture
Update README.md
a27fe0b verified
metadata
language:
  - en
  - es
  - fr
  - de
  - it
  - hi
  - mr
  - sa
  - kn
  - te
  - ta
  - ml
  - zh
  - ja
  - ko
  - ar
  - bn
  - gu
  - or
  - pa
  - ru
  - th
license: gemma
library_name: transformers
tags:
  - vision-language
  - retrieval
  - colbert
  - late-interaction
  - multimodal
  - multilingual
  - document-retrieval
  - 22-languages
pipeline_tag: visual-document-retrieval
base_model:
  - google/gemma-3-4b-it
datasets:
  - Cognitive-Lab/nayanair-bench
model-index:
  - name: ColNetraEmbed
    results:
      - task:
          type: image-text-retrieval
          name: Cross-Lingual Document Retrieval
        dataset:
          type: Cognitive-Lab/nayanair-bench
          name: Nayana-IR Cross-Lingual
          split: test
        metrics:
          - type: ndcg_at_5
            value: 0.637
            name: NDCG@5
          - type: recall_at_10
            value: 0.7
            name: Recall@10
          - type: map_at_10
            value: 0.61
            name: MAP@10
          - type: mrr_at_10
            value: 0.61
            name: MRR@10
      - task:
          type: image-text-retrieval
          name: Monolingual Document Retrieval
        dataset:
          type: Cognitive-Lab/nayanair-bench
          name: Nayana-IR Monolingual
          split: test
        metrics:
          - type: ndcg_at_5
            value: 0.67
            name: NDCG@5
          - type: recall_at_10
            value: 0.764
            name: Recall@10
          - type: map_at_10
            value: 0.645
            name: MAP@10
          - type: mrr_at_10
            value: 0.686
            name: MRR@10
      - task:
          type: image-text-retrieval
          name: English Document Retrieval
        dataset:
          type: vidore/vidore-benchmark
          name: ViDoRe v2
          split: test
        metrics:
          - type: ndcg_at_5
            value: 0.551
            name: NDCG@5
          - type: recall_at_10
            value: 0.664
            name: Recall@10
          - type: map_at_10
            value: 0.445
            name: MAP@10
          - type: mrr_at_10
            value: 0.445
            name: MRR@10

ColNetraEmbed

Group 54 (1)

Paper GitHub Model Blog Demo Colab Colab

ColNetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.

Model Description

ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).

  • Model Type: Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
  • Architecture: ColPali with Gemma3-4B backbone
  • Embedding Dimension: 128 per token
  • Capabilities: Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
  • Use Case: Visual document retrieval, multilingual document understanding, fine-grained visual search

Paper

πŸ“„ M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Installation

pip install git+https://github.com/adithya-s-k/colpali.git

Quick Start

import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)  # Shape: (num_images, num_patches, 128)
    query_embeddings = model(**batch_queries)  # Shape: (num_queries, num_tokens, 128)

# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")

Use Cases

  • Document Retrieval: Search through large collections of visual documents
  • Visual Question Answering: Answer questions about document content
  • Document Understanding: Extract and match information from scanned documents
  • Cross-lingual Document Search: Multilingual visual document retrieval

Model Details

  • Base Model: Gemma3-4B-IT
  • Vision Encoder: SigLIP
  • Training Data: Multilingual document datasets
  • Embedding Strategy: Multi-vector (Late Interaction)
  • Similarity Function: MaxSim (Maximum Similarity)

Performance

ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on Nayana-IR Bench (22 languages) and ViDoRe v2.

Benchmark Results

Nayana-IR Cross-Lingual

Model NDCG@5 Recall@10 MAP@10 MRR@10
ColNetraEmbed 0.637 0.700 0.610 0.610
Jina-Embeddings-v4 0.435 0.435 0.390 0.548
ColNomic-Embed-3B 0.315 0.320 0.267 0.444
ColPali-v1.3 0.284 0.347 0.249 0.403
GME-Qwen2-VL-2B 0.235 0.308 0.209 0.314
ColQwen2.5-v0.2 0.143 0.160 0.127 0.220
ColQwen2-v1.0 0.050 0.065 0.038 0.109

Nayana-IR Monolingual

Model NDCG@5 Recall@10 MAP@10 MRR@10
ColNetraEmbed 0.670 0.764 0.645 0.686
ColNomic-Embed-3B 0.534 0.603 0.515 0.546
ColQwen2.5-v0.2 0.453 0.513 0.437 0.464
GME-Qwen2-VL-2B 0.444 0.525 0.426 0.452
ColQwen2-v1.0 0.413 0.466 0.398 0.422
ColPali-v1.3 0.410 0.484 0.393 0.422

ViDoRe v2

Model NDCG@5 Recall@10 MAP@10 MRR@10
ColQwen2.5-v0.2 0.592 0.664 0.484 0.711
Jina-Embeddings-v4 0.576 0.686 - -
GME-Qwen2-VL-2B 0.574 0.630 0.466 0.690
ColNomic-Embed-3B 0.556 0.633 0.451 0.672
ColNetraEmbed 0.551 0.664 0.445 0.445
ColQwen2-v1.0 0.545 0.640 0.438 0.653
ColPali-v1.3 0.538 0.627 0.436 0.644

Key Results:

  • πŸ† Strong multilingual performance with ColBERT-style late interaction
  • πŸ“ˆ 124% improvement over ColPali-v1.3 on cross-lingual tasks
  • 🌍 Supports 22 languages across diverse script families
  • πŸ” Fine-grained matching through token-level MaxSim scoring

Comparison: Multi-vector vs Single-vector

  • ColNetraEmbed (multi-vector): More interpretable with token-level attribution
  • NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage

See our paper for comprehensive evaluation and architectural comparisons.

Citation

@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}

License

This model is released under the same license as the base Gemma3 model.

Acknowledgments

This work benefited from compute credits for training, inference, and evaluation provided by Modal, acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the Meta LLaMA Impact Grant through our Nayana initiative. We appreciate Meta for continued support of our research efforts at CognitiveLab.

Built on top of the ColPali framework and Gemma3 architecture.