image

BGE-M3 ONNX

Complete BGE-M3 embedding model converted to ONNX format with full multi-vector functionality. While the original BAAI model on Hugging Face has an ONNX format available, it doesn't support sparse and ColBERT vector generation - but this model does.

The files for this model also include a tokenizer ONNX model that can be natively inferenced by ONNX Runtime, enabling native usage across multiple programming languages including C#, Java, and Python by using ONNX Runtime Extensions.

Below you will find detailed information, examples, and links to source code to help you get started.

πŸ”— Important Links

  • πŸ“ GitHub Repository - Essential reading! Contains detailed documentation, performance benchmarks, cross-language validation tests, and implementation examples.
  • πŸ““ Conversion Notebook - Complete step-by-step conversion process from FlagEmbedding to ONNX.

⚠️ Please visit the GitHub repository for information on how this model works, performance comparisons, and detailed usage examples across multiple programming languages.

βœ… Validation Results

This ONNX conversion has been thoroughly tested and produces 100% identical results to the original BAAI/bge-m3 model. All three embedding types (dense, sparse, and ColBERT) maintain exact accuracy.

Model Details

Model Description

BGE-M3 ONNX is a converted version of the BAAI/bge-m3 model optimized for cross-platform deployment (C#, Java, Python). This conversion enables all three embedding types (dense, sparse, and ColBERT vectors) that are not supported by the original model's ONNX version.

Model Sources

Uses

Direct Use

This ONNX model enables:

  • Cross-platform deployment: Use BGE-M3 embeddings in C#, Java, Python, and other languages
  • Offline inference: Generate embeddings locally without API dependencies
  • GPU acceleration: CUDA support for improved performance (examples)
  • Multi-vector output: Generate dense, sparse, and ColBERT embeddings simultaneously

Downstream Use

Perfect for applications requiring:

  • Semantic search and retrieval
  • Document similarity and clustering
  • Cross-lingual information retrieval
  • Hybrid search systems (combining dense and sparse retrieval)

How to Get Started with the Model

Python Usage

Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/python

from bge_m3_embedder import create_cpu_embedder, create_cuda_embedder

# Create CPU-optimized embedder
embedder = create_cpu_embedder("bge_m3_tokenizer.onnx", "bge_m3_model.onnx")

# Generate all three embedding types
result = embedder.encode("Hello world!")

print(f"Dense: {len(result['dense_vecs'])} dimensions")
print(f"Sparse: {len(result['lexical_weights'])} tokens")
print(f"ColBERT: {len(result['colbert_vecs'])} vectors")

# Clean up resources
embedder.close()

# For CUDA acceleration
cuda_embedder = create_cuda_embedder("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", device_id=0)
result = cuda_embedder.encode("Hello world!")
cuda_embedder.close()

# See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/python

C# Usage

Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/dotnet

using BgeM3.Onnx;

// Create CPU-optimized embedder
using var embedder = M3EmbedderFactory.CreateCpuOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx");

// Generate all embedding types
var result = embedder.GenerateEmbeddings("Hello world!");

Console.WriteLine($"Dense: {result.DenseEmbedding.Length} dimensions");
Console.WriteLine($"Sparse: {result.SparseWeights.Count} tokens");
Console.WriteLine($"ColBERT: {result.ColBertVectors.Length} vectors");

// For CUDA acceleration
using var cudaEmbedder = M3EmbedderFactory.CreateCudaOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", deviceId: 0);
var cudaResult = cudaEmbedder.GenerateEmbeddings("Hello world!");

// See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/dotnet

Java Usage

Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/java/bge-m3-onnx

import com.yunikosoftware.bgem3onnx.*;

// Create CPU-optimized embedder
try (M3Embedder embedder = M3EmbedderFactory.createCpuOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx")) {
    // Generate all embedding types
    M3EmbeddingOutput result = embedder.generateEmbeddings("Hello world!");
    
    System.out.println("Dense: " + result.getDenseEmbedding().length + " dimensions");
    System.out.println("Sparse: " + result.getSparseWeights().size() + " tokens");
    System.out.println("ColBERT: " + result.getColBertVectors().length + " vectors");
}

// For CUDA acceleration
try (M3Embedder cudaEmbedder = M3EmbedderFactory.createCudaOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", 0)) {
    M3EmbeddingOutput result = cudaEmbedder.generateEmbeddings("Hello world!");
    // Process CUDA results
}

// See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/java/bge-m3-onnx

Model Files

The files for this model includes:

  • bge_m3_tokenizer.onnx - ONNX tokenizer for text preprocessing
  • bge_m3_model.onnx - Main BGE-M3 embedding model graph
  • bge_m3_model.onnx_data - Model weights in external data format

Contact

For questions about this ONNX conversion, please visit the repository or open an issue.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for yuniko-software/bge-m3-onnx

Base model

BAAI/bge-m3
Quantized
(67)
this model