embeddinggemma-300m-onnx

Model Overview

embeddinggemma-300m-onnx is an efficient ONNX-exported version of Google's Gemma embedding model consisting of approximately 300 million parameters.
It generates high-quality semantic embeddings for text, suitable for a wide variety of NLP tasks including sentence similarity, text clustering, classification, and retrieval.

Converting the original Gemma embedding model to ONNX allows hardware-agnostic, optimized inference across CPUs and GPUs using ONNX Runtime.

Original Model Reference

This model is based on the Google Gemma embedding architecture known for its efficiency and multilingual capabilities.
Refer to the original Hugging Face repository and documentation here:
google/embeddinggemma-300m
Please cite the original papers when using this model in research or production.

Intended Use Cases

  • Semantic similarity scoring
  • Embedding generation for search and recommendation systems
  • Text classification and clustering
  • Scalable, low-latency inference setups requiring ONNX-compatible models

Repository Files

Filename Description
model.onnx ONNX model file containing network weights and graph
config.json Model configuration file used by transformers
tokenizer.json Tokenizer vocabulary and merges
tokenizer_config.json Tokenizer config
special_tokens_map.json Map of special tokens used during tokenization

Installation

pip install onnxruntime transformers huggingface_hub

Usage

1. Using the model locally after cloning or downloading files

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(".")
session = ort.InferenceSession("./model.onnx")

text = "Example input text"
inputs = tokenizer(text, return_tensors="np")
outputs = session.run(None, dict(inputs))
embeddings = outputs
print(embeddings)

2. Using the model directly from Hugging Face Hub (no manual download)

from huggingface_hub import hf_hub_download
import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("be1newinner/embeddinggemma-300m-onnx")
onnx_model_path = hf_hub_download(
    repo_id="be1newinner/embeddinggemma-300m-onnx",
    filename="model.onnx"
)
session = ort.InferenceSession(onnx_model_path)

text = "Example input text"
inputs = tokenizer(text, return_tensors="np")
outputs = session.run(None, dict(inputs))
embeddings = outputs
print(embeddings)

3. Generate Fixed 768 Dimensions Size Embeddings as output

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

class GemmaEmbedder:
    """
    A class to generate embeddings using a Gemma model in ONNX format.
    """

    def __init__(self, model_repo="be1newinner/embeddinggemma-300m-onnx"):
        """
        Initializes the GemmaEmbedder by loading the tokenizer and ONNX model.

        Args:
            model_repo (str): The repository ID of the ONNX model on Hugging Face Hub.
        """
        # Load the tokenizer from the Hugging Face Hub
        self.tokenizer = AutoTokenizer.from_pretrained(model_repo)

        # Download and load the ONNX model
        onnx_model_path = hf_hub_download(repo_id=model_repo, filename="model.onnx")
        self.session = ort.InferenceSession(onnx_model_path)

    def generate(self, text: str):
        """
        Generates a fixed-size embedding for the input text.

        Args:
            text (str): The input text to embed.

        Returns:
            np.ndarray: The generated embedding as a NumPy array.
        """
        # Tokenize the input text, padding and truncating to a consistent length
        inputs = self.tokenizer(
            text, return_tensors="np", padding=True, truncation=True
        )

        # Run the ONNX model to get the last hidden states
        outputs = self.session.run(None, dict(inputs))

        # Perform mean pooling to get a fixed-size embedding
        last_hidden_states = outputs[0]
        input_mask_expanded = np.expand_dims(inputs["attention_mask"], -1).astype(float)
        sum_embeddings = np.sum(last_hidden_states * input_mask_expanded, 1)
        sum_mask = np.clip(input_mask_expanded.sum(1), a_min=1e-9, a_max=None)

        pooled_embeddings = sum_embeddings / sum_mask
        return pooled_embeddings


# Create a global instance of the embedder to avoid reloading the model
embedder = GemmaEmbedder()


def generate(text: str):
    """
    A convenience function to generate embeddings using the global embedder instance.

    Args:
        text (str): The input text to embed.

    Returns:
        np.ndarray: The generated embedding.
    """
    return embedder.generate(text)


if __name__ == "__main__":
    # Example usage of the generate function
    embeddings = generate("Example input text")
    print(embeddings)
    print(f"Embedding shape: {embeddings.shape}")

Citation

If you use this model in your work, please cite:

  • The original Google Gemma embedding model papers (links on original repo).
  • This repository and Hugging Face hosting if applicable.

Feel free to contribute improvements or report issues via this repository’s GitHub page.


Last updated: November 2025
Maintainer: be1newinner

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for be1newinner/embeddinggemma-300m-onnx

Quantized
(20)
this model