gemma-3-1b-it-4bit-lora-dpo-aligned-onnx

This is the ONNX-optimized version of gemma-3-1b-it-4bit-lora-dpo-aligned.

Model Details

  • Opset: 13

Usage

from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
import os

# Download ONNX repo locally
onnx_dir = snapshot_download(repo_id="manu02/gemma-3-1b-it-4bit-lora-dpo-aligned-onnx")

# Find model.onnx (handles external data files)
onnx_path = os.path.join(onnx_dir, "model.onnx")

# Load tokenizer (fallback to base repo if needed)
try:
    tokenizer = AutoTokenizer.from_pretrained(onnx_dir)
except Exception:
    tokenizer = AutoTokenizer.from_pretrained("manu02/gemma-3-1b-it-4bit-lora-dpo-aligned")

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import GenerationConfig

# Load model with cache disabled (required for ONNX)
model = ORTModelForCausalLM.from_pretrained(
    onnx_dir,
    file_name="model.onnx",
    provider="CPUExecutionProvider",
    use_cache=False,
)
model.config.use_cache = False
gen_cfg = GenerationConfig.from_model_config(model.config)
gen_cfg.use_cache = False

inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model.generate(
    **inputs,
    generation_config=gen_cfg,
    max_new_tokens=32,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0]))

# Fallback: pure onnxruntime
import onnxruntime as ort
import numpy as np
inputs = tokenizer("Hello, world!", return_tensors="np")
session = ort.InferenceSession(onnx_path)
input_feed = dict(inputs)
outputs = session.run(None, input_feed)
print(outputs)

Performance

This ONNX model provides optimized inference performance with reduced latency and memory usage compared to the PyTorch version.

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manu02/gemma-3-1b-it-4bit-lora-dpo-aligned-onnx

Quantized
(1)
this model