ColNetraEmbed / README.md
AdithyaSK's picture
Update README.md
7d37ff2 verified
|
raw
history blame
3.91 kB
metadata
language:
  - en
license: gemma
library_name: transformers
tags:
  - vision-language
  - retrieval
  - colbert
  - late-interaction
pipeline_tag: visual-document-retrieval
base_model:
  - google/gemma-3-4b-it

ColNetraEmbed

ColNetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.

Model Description

ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).

  • Model Type: Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
  • Architecture: ColPali with Gemma3-2B backbone
  • Embedding Dimension: 128 per token
  • Capabilities: Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
  • Use Case: Visual document retrieval, multilingual document understanding, fine-grained visual search

Paper

📄 M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Installation

pip install git+https://github.com/adithya-s-k/colpali.git

Quick Start

import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)  # Shape: (num_images, num_patches, 128)
    query_embeddings = model(**batch_queries)  # Shape: (num_queries, num_tokens, 128)

# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")

Use Cases

  • Document Retrieval: Search through large collections of visual documents
  • Visual Question Answering: Answer questions about document content
  • Document Understanding: Extract and match information from scanned documents
  • Cross-lingual Document Search: Multilingual visual document retrieval

Model Details

  • Base Model: Gemma3-2B
  • Vision Encoder: SigLIP
  • Training Data: Multilingual document datasets
  • Embedding Strategy: Multi-vector (Late Interaction)
  • Similarity Function: MaxSim (Maximum Similarity)

Performance

ColNetraEmbed achieves state-of-the-art results on visual document retrieval benchmarks. See our paper for detailed evaluation metrics.

Citation

@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}

License

This model is released under the same license as the base Gemma3 model.

Acknowledgments

Built on top of the ColPali framework and Gemma3 architecture.