BGE-M3 ONNX
Complete BGE-M3 embedding model converted to ONNX format with full multi-vector functionality. While the original BAAI model on Hugging Face has an ONNX format available, it doesn't support sparse and ColBERT vector generation - but this model does.
The files for this model also include a tokenizer ONNX model that can be natively inferenced by ONNX Runtime, enabling native usage across multiple programming languages including C#, Java, and Python by using ONNX Runtime Extensions.
Below you will find detailed information, examples, and links to source code to help you get started.
π Important Links
- π GitHub Repository - Essential reading! Contains detailed documentation, performance benchmarks, cross-language validation tests, and implementation examples.
- π Conversion Notebook - Complete step-by-step conversion process from FlagEmbedding to ONNX.
β οΈ Please visit the GitHub repository for information on how this model works, performance comparisons, and detailed usage examples across multiple programming languages.
β Validation Results
This ONNX conversion has been thoroughly tested and produces 100% identical results to the original BAAI/bge-m3 model. All three embedding types (dense, sparse, and ColBERT) maintain exact accuracy.
Model Details
Model Description
BGE-M3 ONNX is a converted version of the BAAI/bge-m3 model optimized for cross-platform deployment (C#, Java, Python). This conversion enables all three embedding types (dense, sparse, and ColBERT vectors) that are not supported by the original model's ONNX version.
- Developed by: BAAI (original model), converted by Yuniko Software
- Model type: Multilingual embedding model (XLM-RoBERTa based)
- Language(s): 100+ languages
- License: Apache 2.0
- Base model: BAAI/bge-m3
Model Sources
- Repository: yuniko-software/bge-m3-onnx
- Original Paper: BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Uses
Direct Use
This ONNX model enables:
- Cross-platform deployment: Use BGE-M3 embeddings in C#, Java, Python, and other languages
- Offline inference: Generate embeddings locally without API dependencies
- GPU acceleration: CUDA support for improved performance (examples)
- Multi-vector output: Generate dense, sparse, and ColBERT embeddings simultaneously
Downstream Use
Perfect for applications requiring:
- Semantic search and retrieval
- Document similarity and clustering
- Cross-lingual information retrieval
- Hybrid search systems (combining dense and sparse retrieval)
How to Get Started with the Model
Python Usage
Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/python
from bge_m3_embedder import create_cpu_embedder, create_cuda_embedder
# Create CPU-optimized embedder
embedder = create_cpu_embedder("bge_m3_tokenizer.onnx", "bge_m3_model.onnx")
# Generate all three embedding types
result = embedder.encode("Hello world!")
print(f"Dense: {len(result['dense_vecs'])} dimensions")
print(f"Sparse: {len(result['lexical_weights'])} tokens")
print(f"ColBERT: {len(result['colbert_vecs'])} vectors")
# Clean up resources
embedder.close()
# For CUDA acceleration
cuda_embedder = create_cuda_embedder("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", device_id=0)
result = cuda_embedder.encode("Hello world!")
cuda_embedder.close()
# See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/python
C# Usage
Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/dotnet
using BgeM3.Onnx;
// Create CPU-optimized embedder
using var embedder = M3EmbedderFactory.CreateCpuOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx");
// Generate all embedding types
var result = embedder.GenerateEmbeddings("Hello world!");
Console.WriteLine($"Dense: {result.DenseEmbedding.Length} dimensions");
Console.WriteLine($"Sparse: {result.SparseWeights.Count} tokens");
Console.WriteLine($"ColBERT: {result.ColBertVectors.Length} vectors");
// For CUDA acceleration
using var cudaEmbedder = M3EmbedderFactory.CreateCudaOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", deviceId: 0);
var cudaResult = cudaEmbedder.GenerateEmbeddings("Hello world!");
// See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/dotnet
Java Usage
Full Sample: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/java/bge-m3-onnx
import com.yunikosoftware.bgem3onnx.*;
// Create CPU-optimized embedder
try (M3Embedder embedder = M3EmbedderFactory.createCpuOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx")) {
// Generate all embedding types
M3EmbeddingOutput result = embedder.generateEmbeddings("Hello world!");
System.out.println("Dense: " + result.getDenseEmbedding().length + " dimensions");
System.out.println("Sparse: " + result.getSparseWeights().size() + " tokens");
System.out.println("ColBERT: " + result.getColBertVectors().length + " vectors");
}
// For CUDA acceleration
try (M3Embedder cudaEmbedder = M3EmbedderFactory.createCudaOptimized("bge_m3_tokenizer.onnx", "bge_m3_model.onnx", 0)) {
M3EmbeddingOutput result = cudaEmbedder.generateEmbeddings("Hello world!");
// Process CUDA results
}
// See full implementation: https://github.com/yuniko-software/bge-m3-onnx/tree/main/samples/java/bge-m3-onnx
Model Files
The files for this model includes:
bge_m3_tokenizer.onnx- ONNX tokenizer for text preprocessingbge_m3_model.onnx- Main BGE-M3 embedding model graphbge_m3_model.onnx_data- Model weights in external data format
Contact
For questions about this ONNX conversion, please visit the repository or open an issue.
Model tree for yuniko-software/bge-m3-onnx
Base model
BAAI/bge-m3