alecccdd's picture
Update README.md
dab37eb verified
metadata
library_name: transformers
pipeline_tag: image-text-to-text
license: other
base_model:
  - moondream/moondream3-preview

Moondream 3 (Preview) 4-Bit

4bit-efficiency-gains-and-performance-tradeoffs

Moondream 3 (Preview) 4-Bit is the INT4 quantized version of Moondream3-Preview, reducing model size from ~18GB to ~6GB (~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality.

This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM.

Features

  • 66% smaller: ~6GB vs ~18GB original
  • Lower memory: Runs on 7GB VRAM (vs 20GB for FP16)
  • Same capabilities: Retains original Moondream3 skills & API
  • Minimal quality loss: ~2-5% degradation on benchmarks
  • HuggingFace compatible: Load with AutoModelForCausalLM.from_pretrained()

VRAM & Time Savings

Configuration Model Size VRAM usage s/query*
FP16 (original) 18.5 GB 19,594 MiB 4.19
INT4 (this one) 6.18 GB 7,332 MiB 2.65
Reduction 66 % 62 % 37 %

(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)

Evaluation Results

Test time (4-bit) accuracy (4-bit) time (base) accuracy (base)
vision-ai-checkup 156 s 42.8 % 223 s 47.2 %
CountBenchQA 22.9 min 91.2 % 36.6 min 93.2 %

image (9)

Architecture

Quantized Components (INT4):

  • Text attention QKV/projection layers
  • Dense MLP layers (layers 0-3)
  • MoE expert weights (layers 4-23, 64 experts each)
  • Region model encoder/decoder

Preserved in FP16:

  • Vision encoder (SigLIP)
  • MoE routers (critical for expert selection)
  • Temperature (tau) parameters
  • LayerNorms, embeddings, LM head

moondream3-preview-4bit-visualization

Slow First-Time Compile and Inference

A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's correctly configured. I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :)

Quick Start (HuggingFace Style)

The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# Load quantized model (same API as original Moondream3-preview)
moondream = AutoModelForCausalLM.from_pretrained(
    "alecccdd/moondream3-preview-4bit",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
moondream.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])

Alternative: Manual Loading

If you prefer more control, you can load the model directly:

import torch
from PIL import Image
from config import MoondreamConfig
from moondream import MoondreamModel
from weights import load_weights

# Load quantized model
model = MoondreamModel(MoondreamConfig())
load_weights("./", model, device="cuda")
model.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = model.query(image=image, question="What's in this image?")
print(result["answer"])

Skills

API of all skills remains identical to the original moondream3-preview model.

License

This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1.

Original Copyright (c) M87 Labs, Inc.

Quantization and conversion code: Copyright (c) 2025 Alicius Schröder