--- language: - en license: gemma library_name: transformers tags: - vision-language - retrieval - colbert - late-interaction pipeline_tag: visual-document-retrieval base_model: - google/gemma-3-4b-it --- # ColNetraEmbed **ColNetraEmbed** is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations. ## Model Description ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim). - **Model Type:** Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations - **Architecture:** ColPali with Gemma3-2B backbone - **Embedding Dimension:** 128 per token - **Capabilities:** Multilingual, Multimodal (Vision + Text), Multi-vector late interaction - **Use Case:** Visual document retrieval, multilingual document understanding, fine-grained visual search ## Paper 📄 **[M3DR: Towards Universal Multilingual Multimodal Document Retrieval](https://arxiv.org/abs/2512.03514)** ## Installation ```bash pip install git+https://github.com/adithya-s-k/colpali.git ``` ## Quick Start ```python import torch from PIL import Image from colpali_engine.models import ColGemma3, ColGemmaProcessor3 # Load model and processor model_name = "Cognitive-Lab/ColNetraEmbed" model = ColGemma3.from_pretrained( model_name, dtype=torch.bfloat16, device_map="cuda", ) processor = ColGemmaProcessor3.from_pretrained(model_name) # Load your images images = [ Image.open("document1.jpg"), Image.open("document2.jpg"), ] # Define queries queries = [ "What is the total revenue?", "Show me the organizational chart", ] # Process and encode batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) with torch.no_grad(): image_embeddings = model(**batch_images) # Shape: (num_images, num_patches, 128) query_embeddings = model(**batch_queries) # Shape: (num_queries, num_tokens, 128) # Compute similarity scores using MaxSim scores = processor.score_multi_vector( qs=query_embeddings, ps=image_embeddings, ) # Shape: (num_queries, num_images) # Get best matches for i, query in enumerate(queries): best_idx = scores[i].argmax().item() print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})") ``` ## Use Cases - **Document Retrieval:** Search through large collections of visual documents - **Visual Question Answering:** Answer questions about document content - **Document Understanding:** Extract and match information from scanned documents - **Cross-lingual Document Search:** Multilingual visual document retrieval ## Model Details - **Base Model:** Gemma3-2B - **Vision Encoder:** SigLIP - **Training Data:** Multilingual document datasets - **Embedding Strategy:** Multi-vector (Late Interaction) - **Similarity Function:** MaxSim (Maximum Similarity) ## Performance ColNetraEmbed achieves state-of-the-art results on visual document retrieval benchmarks. See our [paper](https://arxiv.org/abs/2512.03514) for detailed evaluation metrics. ## Citation ```bibtex @misc{kolavi2025m3druniversalmultilingualmultimodal, title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, author={Adithya S Kolavi and Vyoman Jain}, year={2025}, eprint={2512.03514}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2512.03514} } ``` ## License This model is released under the same license as the base Gemma3 model. ## Acknowledgments Built on top of the ColPali framework and Gemma3 architecture.