You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Moondream 3 (Preview) 4-Bit

4bit-efficiency-gains-and-performance-tradeoffs

Moondream 3 (Preview) 4-Bit is the INT4 quantized version of Moondream3-Preview, reducing model size from ~18GB to ~6GB (~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality.

This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM.

Features

  • 66% smaller: ~6GB vs ~18GB original
  • Lower memory: Runs on 7GB VRAM (vs 20GB for FP16)
  • Same capabilities: Retains original Moondream3 skills & API
  • Minimal quality loss: ~2-5% degradation on benchmarks
  • HuggingFace compatible: Load with AutoModelForCausalLM.from_pretrained()

VRAM & Time Savings

Configuration Model Size VRAM usage s/query*
FP16 (original) 18.5 GB 19,594 MiB 4.19
INT4 (this one) 6.18 GB 7,332 MiB 2.65
Reduction 66 % 62 % 37 %

(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)

Evaluation Results

Test time (4-bit) accuracy (4-bit) time (base) accuracy (base)
vision-ai-checkup 156 s 42.8 % 223 s 47.2 %
CountBenchQA 22.9 min 91.2 % 36.6 min 93.2 %

image (9)

Architecture

Quantized Components (INT4):

  • Text attention QKV/projection layers
  • Dense MLP layers (layers 0-3)
  • MoE expert weights (layers 4-23, 64 experts each)
  • Region model encoder/decoder

Preserved in FP16:

  • Vision encoder (SigLIP)
  • MoE routers (critical for expert selection)
  • Temperature (tau) parameters
  • LayerNorms, embeddings, LM head

moondream3-preview-4bit-visualization

Slow First-Time Compile and Inference

A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's correctly configured. I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :)

Quick Start (HuggingFace Style)

The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# Load quantized model (same API as original Moondream3-preview)
moondream = AutoModelForCausalLM.from_pretrained(
    "alecccdd/moondream3-preview-4bit",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
moondream.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])

Alternative: Manual Loading

If you prefer more control, you can load the model directly:

import torch
from PIL import Image
from config import MoondreamConfig
from moondream import MoondreamModel
from weights import load_weights

# Load quantized model
model = MoondreamModel(MoondreamConfig())
load_weights("./", model, device="cuda")
model.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = model.query(image=image, question="What's in this image?")
print(result["answer"])

Skills

API of all skills remains identical to the original moondream3-preview model.

License

This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1.

Original Copyright (c) M87 Labs, Inc.

Quantization and conversion code: Copyright (c) 2025 Alicius Schröder

Downloads last month
207
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alecccdd/moondream3-preview-4bit

Quantized
(1)
this model