BGE-M3 ONNX

Complete BGE-M3 embedding model converted to ONNX format with full multi-vector functionality. While the original BAAI model on Hugging Face has an ONNX format available, it doesn't support sparse and ColBERT vector generation - but this model does.

The files for this model also include a tokenizer ONNX model that can be natively inferenced by ONNX Runtime, enabling native usage across multiple programming languages including C#, Java, and Python by using ONNX Runtime Extensions.

Below you will find detailed information, examples, and links to source code to help you get started.

🔗 Important Links

📁 GitHub Repository - Essential reading! Contains detailed documentation, performance benchmarks, cross-language validation tests, and implementation examples.
📓 Conversion Notebook - Complete step-by-step conversion process from FlagEmbedding to ONNX.

⚠️ Please visit the GitHub repository for information on how this model works, performance comparisons, and detailed usage examples across multiple programming languages.

✅ Validation Results

This ONNX conversion has been thoroughly tested and produces 100% identical results to the original BAAI/bge-m3 model. All three embedding types (dense, sparse, and ColBERT) maintain exact accuracy.

Model Details

Model Description

BGE-M3 ONNX is a converted version of the BAAI/bge-m3 model optimized for cross-platform deployment (C#, Java, Python). This conversion enables all three embedding types (dense, sparse, and ColBERT vectors) that are not supported by the original model's ONNX version.

Developed by: BAAI (original model), converted by Yuniko Software
Model type: Multilingual embedding model (XLM-RoBERTa based)
Language(s): 100+ languages
License: Apache 2.0
Base model: BAAI/bge-m3

Model Sources

Repository: yuniko-software/bge-m3-onnx
Original Paper: BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Uses

Direct Use

This ONNX model enables:

Cross-platform deployment: Use BGE-M3 embeddings in C#, Java, Python, and other languages
Offline inference: Generate embeddings locally without API dependencies
GPU acceleration: CUDA support for improved performance (examples)
Multi-vector output: Generate dense, sparse, and ColBERT embeddings simultaneously

Downstream Use

Perfect for applications requiring:

Semantic search and retrieval
Document similarity and clustering
Cross-lingual information retrieval
Hybrid search systems (combining dense and sparse retrieval)

How to Get Started with the Model

Python Usage

Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/python

from bge_m3_embedder import create_cpu_embedder, create_cuda_embedder

# Create CPU-optimized embedder
embedder = create_cpu_embedder("bge_m3_tokenizer.onnx", "bge_m3_model.onnx")

# Generate all three embedding types
result = embedder.encode("Hello world!")

print(f"Dense: {len(result['dense_vecs'])} dimensions")
print(f"Sparse: {len(result['lexical_weights'])} tokens")
print(f"ColBERT: {len(result['colbert_vecs'])} vectors")

# Clean up resources
embedder.close()

# For CUDA acceleration
cuda_embedder = create_cuda_embedder("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", device_id=0)
result = cuda_embedder.encode("Hello world!")
cuda_embedder.close()

# See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/python

C# Usage

Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/dotnet

using BgeM3.Onnx;

// Create CPU-optimized embedder
using var embedder = M3EmbedderFactory.CreateCpuOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx");

// Generate all embedding types
var result = embedder.GenerateEmbeddings("Hello world!");

Console.WriteLine($"Dense: {result.DenseEmbedding.Length} dimensions");
Console.WriteLine($"Sparse: {result.SparseWeights.Count} tokens");
Console.WriteLine($"ColBERT: {result.ColBertVectors.Length} vectors");

// For CUDA acceleration
using var cudaEmbedder = M3EmbedderFactory.CreateCudaOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", deviceId: 0);
var cudaResult = cudaEmbedder.GenerateEmbeddings("Hello world!");

// See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/dotnet

Java Usage

Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/java/bge-m3-onnx

import com.yunikosoftware.bgem3onnx.*;

// Create CPU-optimized embedder
try (M3Embedder embedder = M3EmbedderFactory.createCpuOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx")) {
    // Generate all embedding types
    M3EmbeddingOutput result = embedder.generateEmbeddings("Hello world!");
    
    System.out.println("Dense: " + result.getDenseEmbedding().length + " dimensions");
    System.out.println("Sparse: " + result.getSparseWeights().size() + " tokens");
    System.out.println("ColBERT: " + result.getColBertVectors().length + " vectors");
}

// For CUDA acceleration
try (M3Embedder cudaEmbedder = M3EmbedderFactory.createCudaOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", 0)) {
    M3EmbeddingOutput result = cudaEmbedder.generateEmbeddings("Hello world!");
    // Process CUDA results
}

// See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/java/bge-m3-onnx

Model Files

The files for this model includes:

bge_m3_tokenizer.onnx - ONNX tokenizer for text preprocessing
bge_m3_model.onnx - Main BGE-M3 embedding model graph
bge_m3_model.onnx_data - Model weights in external data format

Contact

For questions about this ONNX conversion, please visit the repository or open an issue.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for yuniko-software/bge-m3-onnx

Base model

BAAI/bge-m3

Quantized

(67)

this model