--- model-index: - name: >- Granite-4.0-H-Tiny — MLX (Apple Silicon), 3-bit (plus guidance for 2/4/5/6-bit) results: [] license: apache-2.0 language: - en tags: - ibm - granite - mlx - apple-silicon - mamba2 - transformer - hybrid - moe - long-context - instruct - quantized - MoE pipeline_tag: text-generation library_name: mlx base_model: - ibm-granite/granite-4.0-h-tiny --- # Granite-4.0-H-Tiny — **MLX 3-bit** (Apple Silicon) **Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary) This repository provides an **Apple-Silicon-optimized MLX build** of **IBM Granite-4.0-H-Tiny** with **3-bit** weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows). Granite 4.0 is IBM’s latest **hybrid Mamba-2/Transformer** family with selective **Mixture-of-Experts (MoE)**, designed for **long-context**, **hyper-efficient** inference and **enterprise** use. :contentReference[oaicite:0]{index=0} --- ## 🔎 What’s Granite 4.0? - **Architecture.** Hybrid **Mamba-2 + softmax attention**; *H* variants add **MoE** routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1} - **Efficiency claims.** Up to **~70% lower memory** and **~2× faster** inference vs. comparable models, especially for **multi-session** and **long-context** scenarios. :contentReference[oaicite:2]{index=2} - **Context window.** **128k** tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3} - **Licensing.** **Apache-2.0** for public/commercial use. :contentReference[oaicite:4]{index=4} > This MLX build targets **Granite-4.0-H-Tiny** (≈ **7B total**, ≈ **1B active** parameters). For reference, the family also includes **H-Small (≈32B total / 9B active)** and **Micro/Micro-H (≈3B dense/hybrid)** tiers. :contentReference[oaicite:5]{index=5} --- ## 📦 What’s in this repo (MLX format) - `config.json` (MLX), `mlx_model*.safetensors` (3-bit shards), tokenizer files, and processor metadata. - Ready for **macOS** on **M-series** chips via **Metal/MPS**. > The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: **ibm-granite/granite-4.0-h-tiny**. :contentReference[oaicite:6]{index=6} --- ## ✅ Intended use - General **instruction-following** and **chat** with **long context** (128k). :contentReference[oaicite:7]{index=7} - **Enterprise** assistant patterns (function calling, structured outputs) and **RAG** backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8} - **On-device** development on Macs (MLX), low-latency local prototyping and evaluation. ## ⚠️ Limitations - As a quantized, decoder-only LM, it can produce **confident but wrong** outputs—review for critical use. - **2–4-bit** quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows. - Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9} --- ## 🧠 Model family at a glance | Tier | Arch | Params (total / active) | Notes | |---|---|---:|---| | **H-Small** | Hybrid + **MoE** | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} | | **H-Tiny** *(this repo)* | Hybrid + **MoE** | ~7B / **1B** | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} | | **Micro / H-Micro** | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} | **Context Window:** up to **128k** tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13} **License:** Apache-2.0. :contentReference[oaicite:14]{index=14} --- ## 🧪 Observed on-device behavior (MLX) Empirically on M-series Macs: - **3-bit** often gives **crisp, direct** answers with good latency and modest RAM. - **Higher bit-widths** (4/5/6-bit) improve faithfulness on **fine-grained** tasks (tiny OCR, structured parsing), at higher memory cost. > Performance varies by Mac model, image/token lengths, and temperature; validate on your workload. --- ## 🔢 Choosing a quantization level (Apple Silicon) | Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose | |---|---:|:---:|---|---| | **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests | | **3-bit** *(this build)* | ~5–6 GB | **🔥🔥🔥🔥** | **Direct, concise**, great latency | **Default** for local dev on M1/M2/M3/M4 | | **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness | | **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs | | **6-bit** | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample | > Figures are indicative for **language-only** Tiny (no vision), and will vary with context length and KV cache size. --- ## 🚀 Quickstart (CLI — MLX) ```bash # Plain generation (deterministic) python -m mlx_lm.generate \ --model \ --prompt "Summarize the following notes into 5 bullet points:\n" \ --max-tokens 200 \ --temperature 0.0 \ --device mps \ --seed 0