embeddinggemma-300m-onnx
Model Overview
embeddinggemma-300m-onnx is an efficient ONNX-exported version of Google's Gemma embedding model consisting of approximately 300 million parameters.
It generates high-quality semantic embeddings for text, suitable for a wide variety of NLP tasks including sentence similarity, text clustering, classification, and retrieval.
Converting the original Gemma embedding model to ONNX allows hardware-agnostic, optimized inference across CPUs and GPUs using ONNX Runtime.
Original Model Reference
This model is based on the Google Gemma embedding architecture known for its efficiency and multilingual capabilities.
Refer to the original Hugging Face repository and documentation here:
google/embeddinggemma-300m
Please cite the original papers when using this model in research or production.
Intended Use Cases
- Semantic similarity scoring
- Embedding generation for search and recommendation systems
- Text classification and clustering
- Scalable, low-latency inference setups requiring ONNX-compatible models
Repository Files
| Filename | Description |
|---|---|
model.onnx |
ONNX model file containing network weights and graph |
config.json |
Model configuration file used by transformers |
tokenizer.json |
Tokenizer vocabulary and merges |
tokenizer_config.json |
Tokenizer config |
special_tokens_map.json |
Map of special tokens used during tokenization |
Installation
pip install onnxruntime transformers huggingface_hub
Usage
1. Using the model locally after cloning or downloading files
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(".")
session = ort.InferenceSession("./model.onnx")
text = "Example input text"
inputs = tokenizer(text, return_tensors="np")
outputs = session.run(None, dict(inputs))
embeddings = outputs
print(embeddings)
2. Using the model directly from Hugging Face Hub (no manual download)
from huggingface_hub import hf_hub_download
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("be1newinner/embeddinggemma-300m-onnx")
onnx_model_path = hf_hub_download(
repo_id="be1newinner/embeddinggemma-300m-onnx",
filename="model.onnx"
)
session = ort.InferenceSession(onnx_model_path)
text = "Example input text"
inputs = tokenizer(text, return_tensors="np")
outputs = session.run(None, dict(inputs))
embeddings = outputs
print(embeddings)
3. Generate Fixed 768 Dimensions Size Embeddings as output
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
class GemmaEmbedder:
"""
A class to generate embeddings using a Gemma model in ONNX format.
"""
def __init__(self, model_repo="be1newinner/embeddinggemma-300m-onnx"):
"""
Initializes the GemmaEmbedder by loading the tokenizer and ONNX model.
Args:
model_repo (str): The repository ID of the ONNX model on Hugging Face Hub.
"""
# Load the tokenizer from the Hugging Face Hub
self.tokenizer = AutoTokenizer.from_pretrained(model_repo)
# Download and load the ONNX model
onnx_model_path = hf_hub_download(repo_id=model_repo, filename="model.onnx")
self.session = ort.InferenceSession(onnx_model_path)
def generate(self, text: str):
"""
Generates a fixed-size embedding for the input text.
Args:
text (str): The input text to embed.
Returns:
np.ndarray: The generated embedding as a NumPy array.
"""
# Tokenize the input text, padding and truncating to a consistent length
inputs = self.tokenizer(
text, return_tensors="np", padding=True, truncation=True
)
# Run the ONNX model to get the last hidden states
outputs = self.session.run(None, dict(inputs))
# Perform mean pooling to get a fixed-size embedding
last_hidden_states = outputs[0]
input_mask_expanded = np.expand_dims(inputs["attention_mask"], -1).astype(float)
sum_embeddings = np.sum(last_hidden_states * input_mask_expanded, 1)
sum_mask = np.clip(input_mask_expanded.sum(1), a_min=1e-9, a_max=None)
pooled_embeddings = sum_embeddings / sum_mask
return pooled_embeddings
# Create a global instance of the embedder to avoid reloading the model
embedder = GemmaEmbedder()
def generate(text: str):
"""
A convenience function to generate embeddings using the global embedder instance.
Args:
text (str): The input text to embed.
Returns:
np.ndarray: The generated embedding.
"""
return embedder.generate(text)
if __name__ == "__main__":
# Example usage of the generate function
embeddings = generate("Example input text")
print(embeddings)
print(f"Embedding shape: {embeddings.shape}")
Citation
If you use this model in your work, please cite:
- The original Google Gemma embedding model papers (links on original repo).
- This repository and Hugging Face hosting if applicable.
Feel free to contribute improvements or report issues via this repository’s GitHub page.
Last updated: November 2025
Maintainer: be1newinner
- Downloads last month
- 38
Model tree for be1newinner/embeddinggemma-300m-onnx
Base model
google/embeddinggemma-300m