README.md · mlx-community/granite-4.0-h-tiny-3bit-MLX at main

File size: 5,537 Bytes

---
model-index:
- name: >-
    Granite-4.0-H-Tiny — MLX (Apple Silicon), 3-bit (plus guidance for
    2/4/5/6-bit)
  results: []
license: apache-2.0
language:
- en
tags:
- ibm
- granite
- mlx
- apple-silicon
- mamba2
- transformer
- hybrid
- moe
- long-context
- instruct
- quantized
- MoE
pipeline_tag: text-generation
library_name: mlx
base_model:
- ibm-granite/granite-4.0-h-tiny
---

# Granite-4.0-H-Tiny — **MLX 3-bit** (Apple Silicon)
**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)  

This repository provides an **Apple-Silicon-optimized MLX build** of **IBM Granite-4.0-H-Tiny** with **3-bit** weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows).  
Granite 4.0 is IBM’s latest **hybrid Mamba-2/Transformer** family with selective **Mixture-of-Experts (MoE)**, designed for **long-context**, **hyper-efficient** inference and **enterprise** use. :contentReference[oaicite:0]{index=0}

---

## 🔎 What’s Granite 4.0?
- **Architecture.** Hybrid **Mamba-2 + softmax attention**; *H* variants add **MoE** routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1}  
- **Efficiency claims.** Up to **~70% lower memory** and **~2× faster** inference vs. comparable models, especially for **multi-session** and **long-context** scenarios. :contentReference[oaicite:2]{index=2}  
- **Context window.** **128k** tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3}  
- **Licensing.** **Apache-2.0** for public/commercial use. :contentReference[oaicite:4]{index=4}

> This MLX build targets **Granite-4.0-H-Tiny** (≈ **7B total**, ≈ **1B active** parameters). For reference, the family also includes **H-Small (≈32B total / 9B active)** and **Micro/Micro-H (≈3B dense/hybrid)** tiers. :contentReference[oaicite:5]{index=5}

---

## 📦 What’s in this repo (MLX format)
- `config.json` (MLX), `mlx_model*.safetensors` (3-bit shards), tokenizer files, and processor metadata.
- Ready for **macOS** on **M-series** chips via **Metal/MPS**.

> The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: **ibm-granite/granite-4.0-h-tiny**. :contentReference[oaicite:6]{index=6}

---

## ✅ Intended use
- General **instruction-following** and **chat** with **long context** (128k). :contentReference[oaicite:7]{index=7}  
- **Enterprise** assistant patterns (function calling, structured outputs) and **RAG** backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8}  
- **On-device** development on Macs (MLX), low-latency local prototyping and evaluation.

## ⚠️ Limitations
- As a quantized, decoder-only LM, it can produce **confident but wrong** outputs—review for critical use.  
- **2–4-bit** quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows.  
- Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9}

---

## 🧠 Model family at a glance
| Tier | Arch | Params (total / active) | Notes |
|---|---|---:|---|
| **H-Small** | Hybrid + **MoE** | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} |
| **H-Tiny** *(this repo)* | Hybrid + **MoE** | ~7B / **1B** | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} |
| **Micro / H-Micro** | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} |

**Context Window:** up to **128k** tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13}  
**License:** Apache-2.0. :contentReference[oaicite:14]{index=14}

---

## 🧪 Observed on-device behavior (MLX)
Empirically on M-series Macs:
- **3-bit** often gives **crisp, direct** answers with good latency and modest RAM.  
- **Higher bit-widths** (4/5/6-bit) improve faithfulness on **fine-grained** tasks (tiny OCR, structured parsing), at higher memory cost.

> Performance varies by Mac model, image/token lengths, and temperature; validate on your workload.

---

## 🔢 Choosing a quantization level (Apple Silicon)
| Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose |
|---|---:|:---:|---|---|
| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests |
| **3-bit** *(this build)* | ~5–6 GB | **🔥🔥🔥🔥** | **Direct, concise**, great latency | **Default** for local dev on M1/M2/M3/M4 |
| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness |
| **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs |
| **6-bit** | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample |

> Figures are indicative for **language-only** Tiny (no vision), and will vary with context length and KV cache size.

---

## 🚀 Quickstart (CLI — MLX)
```bash
# Plain generation (deterministic)
python -m mlx_lm.generate \
  --model <this-repo-id> \
  --prompt "Summarize the following notes into 5 bullet points:\n<your text>" \
  --max-tokens 200 \
  --temperature 0.0 \
  --device mps \
  --seed 0