Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
-
|
| 6 |
-
-
|
| 7 |
-
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
| 10 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
model-index:
|
| 3 |
+
- name: >-
|
| 4 |
+
Granite-4.0-H-Tiny — MLX (Apple Silicon), 3-bit (plus guidance for
|
| 5 |
+
2/4/5/6-bit)
|
| 6 |
+
results: []
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
language:
|
| 9 |
+
- en
|
| 10 |
+
tags:
|
| 11 |
+
- ibm
|
| 12 |
+
- granite
|
| 13 |
+
- mlx
|
| 14 |
+
- apple-silicon
|
| 15 |
+
- mamba2
|
| 16 |
+
- transformer
|
| 17 |
+
- hybrid
|
| 18 |
+
- moe
|
| 19 |
+
- long-context
|
| 20 |
+
- instruct
|
| 21 |
+
- quantized
|
| 22 |
+
pipeline_tag: text-generation
|
| 23 |
+
library_name: mlx
|
| 24 |
+
base_model:
|
| 25 |
+
- ibm-granite/granite-4.0-h-tiny
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
# Granite-4.0-H-Tiny — **MLX 3-bit** (Apple Silicon)
|
| 29 |
+
**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
|
| 30 |
+
|
| 31 |
+
This repository provides an **Apple-Silicon-optimized MLX build** of **IBM Granite-4.0-H-Tiny** with **3-bit** weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows).
|
| 32 |
+
Granite 4.0 is IBM’s latest **hybrid Mamba-2/Transformer** family with selective **Mixture-of-Experts (MoE)**, designed for **long-context**, **hyper-efficient** inference and **enterprise** use. :contentReference[oaicite:0]{index=0}
|
| 33 |
+
|
| 34 |
---
|
| 35 |
+
|
| 36 |
+
## 🔎 What’s Granite 4.0?
|
| 37 |
+
- **Architecture.** Hybrid **Mamba-2 + softmax attention**; *H* variants add **MoE** routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1}
|
| 38 |
+
- **Efficiency claims.** Up to **~70% lower memory** and **~2× faster** inference vs. comparable models, especially for **multi-session** and **long-context** scenarios. :contentReference[oaicite:2]{index=2}
|
| 39 |
+
- **Context window.** **128k** tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3}
|
| 40 |
+
- **Licensing.** **Apache-2.0** for public/commercial use. :contentReference[oaicite:4]{index=4}
|
| 41 |
+
|
| 42 |
+
> This MLX build targets **Granite-4.0-H-Tiny** (≈ **7B total**, ≈ **1B active** parameters). For reference, the family also includes **H-Small (≈32B total / 9B active)** and **Micro/Micro-H (≈3B dense/hybrid)** tiers. :contentReference[oaicite:5]{index=5}
|
| 43 |
+
|
| 44 |
---
|
| 45 |
+
|
| 46 |
+
## 📦 What’s in this repo (MLX format)
|
| 47 |
+
- `config.json` (MLX), `mlx_model*.safetensors` (3-bit shards), tokenizer files, and processor metadata.
|
| 48 |
+
- Ready for **macOS** on **M-series** chips via **Metal/MPS**.
|
| 49 |
+
|
| 50 |
+
> The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: **ibm-granite/granite-4.0-h-tiny**. :contentReference[oaicite:6]{index=6}
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## ✅ Intended use
|
| 55 |
+
- General **instruction-following** and **chat** with **long context** (128k). :contentReference[oaicite:7]{index=7}
|
| 56 |
+
- **Enterprise** assistant patterns (function calling, structured outputs) and **RAG** backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8}
|
| 57 |
+
- **On-device** development on Macs (MLX), low-latency local prototyping and evaluation.
|
| 58 |
+
|
| 59 |
+
## ⚠️ Limitations
|
| 60 |
+
- As a quantized, decoder-only LM, it can produce **confident but wrong** outputs—review for critical use.
|
| 61 |
+
- **2–4-bit** quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows.
|
| 62 |
+
- Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9}
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## 🧠 Model family at a glance
|
| 67 |
+
| Tier | Arch | Params (total / active) | Notes |
|
| 68 |
+
|---|---|---:|---|
|
| 69 |
+
| **H-Small** | Hybrid + **MoE** | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} |
|
| 70 |
+
| **H-Tiny** *(this repo)* | Hybrid + **MoE** | ~7B / **1B** | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} |
|
| 71 |
+
| **Micro / H-Micro** | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} |
|
| 72 |
+
|
| 73 |
+
**Context Window:** up to **128k** tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13}
|
| 74 |
+
**License:** Apache-2.0. :contentReference[oaicite:14]{index=14}
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## 🧪 Observed on-device behavior (MLX)
|
| 79 |
+
Empirically on M-series Macs:
|
| 80 |
+
- **3-bit** often gives **crisp, direct** answers with good latency and modest RAM.
|
| 81 |
+
- **Higher bit-widths** (4/5/6-bit) improve faithfulness on **fine-grained** tasks (tiny OCR, structured parsing), at higher memory cost.
|
| 82 |
+
|
| 83 |
+
> Performance varies by Mac model, image/token lengths, and temperature; validate on your workload.
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## 🔢 Choosing a quantization level (Apple Silicon)
|
| 88 |
+
| Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose |
|
| 89 |
+
|---|---:|:---:|---|---|
|
| 90 |
+
| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests |
|
| 91 |
+
| **3-bit** *(this build)* | ~5–6 GB | **🔥🔥🔥🔥** | **Direct, concise**, great latency | **Default** for local dev on M1/M2/M3/M4 |
|
| 92 |
+
| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness |
|
| 93 |
+
| **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs |
|
| 94 |
+
| **6-bit** | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample |
|
| 95 |
+
|
| 96 |
+
> Figures are indicative for **language-only** Tiny (no vision), and will vary with context length and KV cache size.
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
## 🚀 Quickstart (CLI — MLX)
|
| 101 |
+
```bash
|
| 102 |
+
# Plain generation (deterministic)
|
| 103 |
+
python -m mlx_lm.generate \
|
| 104 |
+
--model <this-repo-id> \
|
| 105 |
+
--prompt "Summarize the following notes into 5 bullet points:\n<your text>" \
|
| 106 |
+
--max-tokens 200 \
|
| 107 |
+
--temperature 0.0 \
|
| 108 |
+
--device mps \
|
| 109 |
+
--seed 0
|