moondream3-preview-4bit / README.md

alecccdd

Update README.md

dab37eb verified 7 days ago

preview code

raw

history blame contribute delete

4.91 kB

metadata

library_name: transformers
pipeline_tag: image-text-to-text
license: other
base_model:
  - moondream/moondream3-preview

Moondream 3 (Preview) 4-Bit

Moondream 3 (Preview) 4-Bit is the INT4 quantized version of Moondream3-Preview, reducing model size from ~18GB to ~6GB (~66% reduction) and allowing to run in <12 GB VRAM environments while mostly maintaining quality.

This is a vision language model with a mixture-of-experts architecture (9B total parameters, 2B active), now optimized for deployment with as little as 8 GB VRAM.

Features

66% smaller: ~6GB vs ~18GB original
Lower memory: Runs on 7GB VRAM (vs 20GB for FP16)
Same capabilities: Retains original Moondream3 skills & API
Minimal quality loss: ~2-5% degradation on benchmarks
HuggingFace compatible: Load with AutoModelForCausalLM.from_pretrained()

VRAM & Time Savings

Configuration	Model Size	VRAM usage	s/query*
FP16 (original)	18.5 GB	19,594 MiB	4.19
INT4 (this one)	6.18 GB	7,332 MiB	2.65
Reduction	66 %	62 %	37 %

(* averaged over vision-ai-checkup & CountBenchQA benchmarks on L40S GPU)

Evaluation Results

Test	time (4-bit)	accuracy (4-bit)		time (base)	accuracy (base)
vision-ai-checkup	156 s	42.8 %		223 s	47.2 %
CountBenchQA	22.9 min	91.2 %		36.6 min	93.2 %

Architecture

Quantized Components (INT4):

Text attention QKV/projection layers
Dense MLP layers (layers 0-3)
MoE expert weights (layers 4-23, 64 experts each)
Region model encoder/decoder

Preserved in FP16:

Vision encoder (SigLIP)
MoE routers (critical for expert selection)
Temperature (tau) parameters
LayerNorms, embeddings, LM head

Slow First-Time Compile and Inference

A note on first-time compilation time: Due to the MoE architecture and the nature of INT4 quants, I had to do some voodoos to get input-invariant compilation graphs for both execution paths (T=1 and T>1 respectively). This results in a longer first-time compilation time (1-3 minutes for me) compared to the original Moondream3-preview model (~30 seconds). Torch's End to end caching (also known as Mega-Cache) makes subsequent compilations on the same machine much faster, given it's correctly configured. I'll remove this note once I found a faster solution (contributions always welcome of course!) in case that's possible, until then Caches are your friend :)

Quick Start (HuggingFace Style)

The easiest way to use Moondream3-4bit is via the HuggingFace Transformers API:

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# Load quantized model (same API as original Moondream3-preview)
moondream = AutoModelForCausalLM.from_pretrained(
    "alecccdd/moondream3-preview-4bit",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
moondream.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = moondream.query(image=image, question="What's in this image?")
print(result["answer"])

Alternative: Manual Loading

If you prefer more control, you can load the model directly:

import torch
from PIL import Image
from config import MoondreamConfig
from moondream import MoondreamModel
from weights import load_weights

# Load quantized model
model = MoondreamModel(MoondreamConfig())
load_weights("./", model, device="cuda")
model.compile()  # Critical for fast inference

# Load an image
image = Image.open("photo.jpg")

# Ask a question
result = model.query(image=image, question="What's in this image?")
print(result["answer"])

Skills

API of all skills remains identical to the original moondream3-preview model.

License

This is a derivative work of Moondream 3 (Preview) which was originally released under the Business Source License 1.1.

Original Copyright (c) M87 Labs, Inc.