Update README.md

Browse files

Files changed (1) hide show

README.md +107 -8

README.md CHANGED Viewed

@@ -1,10 +1,109 @@
 ---
-license: apache-2.0
-library_name: mlx
-tags:
-- language
-- granite-4.0
-- mlx
-pipeline_tag: text-generation
-base_model: ibm-granite/granite-4.0-h-tiny
 ---

+---
+model-index:
+- name: >-
+    Granite-4.0-H-Tiny — MLX (Apple Silicon), 3-bit (plus guidance for
+    2/4/5/6-bit)
+  results: []
+license: apache-2.0
+language:
+- en
+tags:
+- ibm
+- granite
+- mlx
+- apple-silicon
+- mamba2
+- transformer
+- hybrid
+- moe
+- long-context
+- instruct
+- quantized
+pipeline_tag: text-generation
+library_name: mlx
+base_model:
+- ibm-granite/granite-4.0-h-tiny
+---
+# Granite-4.0-H-Tiny — **MLX 3-bit** (Apple Silicon)
+**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
+This repository provides an **Apple-Silicon-optimized MLX build** of **IBM Granite-4.0-H-Tiny** with **3-bit** weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows).
+Granite 4.0 is IBM’s latest **hybrid Mamba-2/Transformer** family with selective **Mixture-of-Experts (MoE)**, designed for **long-context**, **hyper-efficient** inference and **enterprise** use. :contentReference[oaicite:0]{index=0}
 ---
+## 🔎 What’s Granite 4.0?
+- **Architecture.** Hybrid **Mamba-2 + softmax attention**; *H* variants add **MoE** routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1}
+- **Efficiency claims.** Up to **~70% lower memory** and **~2× faster** inference vs. comparable models, especially for **multi-session** and **long-context** scenarios. :contentReference[oaicite:2]{index=2}
+- **Context window.** **128k** tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3}
+- **Licensing.** **Apache-2.0** for public/commercial use. :contentReference[oaicite:4]{index=4}
+> This MLX build targets **Granite-4.0-H-Tiny** (≈ **7B total**, ≈ **1B active** parameters). For reference, the family also includes **H-Small (≈32B total / 9B active)** and **Micro/Micro-H (≈3B dense/hybrid)** tiers. :contentReference[oaicite:5]{index=5}
 ---
+## 📦 What’s in this repo (MLX format)
+- `config.json` (MLX), `mlx_model*.safetensors` (3-bit shards), tokenizer files, and processor metadata.
+- Ready for **macOS** on **M-series** chips via **Metal/MPS**.
+> The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: **ibm-granite/granite-4.0-h-tiny**. :contentReference[oaicite:6]{index=6}
+---
+## ✅ Intended use
+- General **instruction-following** and **chat** with **long context** (128k). :contentReference[oaicite:7]{index=7}
+- **Enterprise** assistant patterns (function calling, structured outputs) and **RAG** backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8}
+- **On-device** development on Macs (MLX), low-latency local prototyping and evaluation.
+## ⚠️ Limitations
+- As a quantized, decoder-only LM, it can produce **confident but wrong** outputs—review for critical use.
+- **2–4-bit** quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows.
+- Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9}
+---
+## 🧠 Model family at a glance
+| Tier | Arch | Params (total / active) | Notes |
+|---|---|---:|---|
+| **H-Small** | Hybrid + **MoE** | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} |
+| **H-Tiny** *(this repo)* | Hybrid + **MoE** | ~7B / **1B** | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} |
+| **Micro / H-Micro** | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} |
+**Context Window:** up to **128k** tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13}
+**License:** Apache-2.0. :contentReference[oaicite:14]{index=14}
+---
+## 🧪 Observed on-device behavior (MLX)
+Empirically on M-series Macs:
+- **3-bit** often gives **crisp, direct** answers with good latency and modest RAM.
+- **Higher bit-widths** (4/5/6-bit) improve faithfulness on **fine-grained** tasks (tiny OCR, structured parsing), at higher memory cost.
+> Performance varies by Mac model, image/token lengths, and temperature; validate on your workload.
+---
+## 🔢 Choosing a quantization level (Apple Silicon)
+| Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose |
+|---|---:|:---:|---|---|
+| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests |
+| **3-bit** *(this build)* | ~5–6 GB | **🔥🔥🔥🔥** | **Direct, concise**, great latency | **Default** for local dev on M1/M2/M3/M4 |
+| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness |
+| **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs |
+| **6-bit** | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample |
+> Figures are indicative for **language-only** Tiny (no vision), and will vary with context length and KV cache size.
+---
+## 🚀 Quickstart (CLI — MLX)
+```bash
+# Plain generation (deterministic)
+python -m mlx_lm.generate \
+  --model <this-repo-id> \
+  --prompt "Summarize the following notes into 5 bullet points:\n<your text>" \
+  --max-tokens 200 \
+  --temperature 0.0 \
+  --device mps \
+  --seed 0