Susant-Achary commited on
Commit
358eac8
·
verified ·
1 Parent(s): 4c05c40

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -8
README.md CHANGED
@@ -1,10 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: apache-2.0
3
- library_name: mlx
4
- tags:
5
- - language
6
- - granite-4.0
7
- - mlx
8
- pipeline_tag: text-generation
9
- base_model: ibm-granite/granite-4.0-h-tiny
 
10
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ model-index:
3
+ - name: >-
4
+ Granite-4.0-H-Tiny — MLX (Apple Silicon), 3-bit (plus guidance for
5
+ 2/4/5/6-bit)
6
+ results: []
7
+ license: apache-2.0
8
+ language:
9
+ - en
10
+ tags:
11
+ - ibm
12
+ - granite
13
+ - mlx
14
+ - apple-silicon
15
+ - mamba2
16
+ - transformer
17
+ - hybrid
18
+ - moe
19
+ - long-context
20
+ - instruct
21
+ - quantized
22
+ pipeline_tag: text-generation
23
+ library_name: mlx
24
+ base_model:
25
+ - ibm-granite/granite-4.0-h-tiny
26
+ ---
27
+
28
+ # Granite-4.0-H-Tiny — **MLX 3-bit** (Apple Silicon)
29
+ **Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
30
+
31
+ This repository provides an **Apple-Silicon-optimized MLX build** of **IBM Granite-4.0-H-Tiny** with **3-bit** weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows).
32
+ Granite 4.0 is IBM’s latest **hybrid Mamba-2/Transformer** family with selective **Mixture-of-Experts (MoE)**, designed for **long-context**, **hyper-efficient** inference and **enterprise** use. :contentReference[oaicite:0]{index=0}
33
+
34
  ---
35
+
36
+ ## 🔎 What’s Granite 4.0?
37
+ - **Architecture.** Hybrid **Mamba-2 + softmax attention**; *H* variants add **MoE** routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1}
38
+ - **Efficiency claims.** Up to **~70% lower memory** and **~2× faster** inference vs. comparable models, especially for **multi-session** and **long-context** scenarios. :contentReference[oaicite:2]{index=2}
39
+ - **Context window.** **128k** tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3}
40
+ - **Licensing.** **Apache-2.0** for public/commercial use. :contentReference[oaicite:4]{index=4}
41
+
42
+ > This MLX build targets **Granite-4.0-H-Tiny** (≈ **7B total**, ≈ **1B active** parameters). For reference, the family also includes **H-Small (≈32B total / 9B active)** and **Micro/Micro-H (≈3B dense/hybrid)** tiers. :contentReference[oaicite:5]{index=5}
43
+
44
  ---
45
+
46
+ ## 📦 What’s in this repo (MLX format)
47
+ - `config.json` (MLX), `mlx_model*.safetensors` (3-bit shards), tokenizer files, and processor metadata.
48
+ - Ready for **macOS** on **M-series** chips via **Metal/MPS**.
49
+
50
+ > The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: **ibm-granite/granite-4.0-h-tiny**. :contentReference[oaicite:6]{index=6}
51
+
52
+ ---
53
+
54
+ ## ✅ Intended use
55
+ - General **instruction-following** and **chat** with **long context** (128k). :contentReference[oaicite:7]{index=7}
56
+ - **Enterprise** assistant patterns (function calling, structured outputs) and **RAG** backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8}
57
+ - **On-device** development on Macs (MLX), low-latency local prototyping and evaluation.
58
+
59
+ ## ⚠️ Limitations
60
+ - As a quantized, decoder-only LM, it can produce **confident but wrong** outputs—review for critical use.
61
+ - **2–4-bit** quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows.
62
+ - Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9}
63
+
64
+ ---
65
+
66
+ ## 🧠 Model family at a glance
67
+ | Tier | Arch | Params (total / active) | Notes |
68
+ |---|---|---:|---|
69
+ | **H-Small** | Hybrid + **MoE** | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} |
70
+ | **H-Tiny** *(this repo)* | Hybrid + **MoE** | ~7B / **1B** | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} |
71
+ | **Micro / H-Micro** | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} |
72
+
73
+ **Context Window:** up to **128k** tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13}
74
+ **License:** Apache-2.0. :contentReference[oaicite:14]{index=14}
75
+
76
+ ---
77
+
78
+ ## 🧪 Observed on-device behavior (MLX)
79
+ Empirically on M-series Macs:
80
+ - **3-bit** often gives **crisp, direct** answers with good latency and modest RAM.
81
+ - **Higher bit-widths** (4/5/6-bit) improve faithfulness on **fine-grained** tasks (tiny OCR, structured parsing), at higher memory cost.
82
+
83
+ > Performance varies by Mac model, image/token lengths, and temperature; validate on your workload.
84
+
85
+ ---
86
+
87
+ ## 🔢 Choosing a quantization level (Apple Silicon)
88
+ | Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose |
89
+ |---|---:|:---:|---|---|
90
+ | **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests |
91
+ | **3-bit** *(this build)* | ~5–6 GB | **🔥🔥🔥🔥** | **Direct, concise**, great latency | **Default** for local dev on M1/M2/M3/M4 |
92
+ | **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness |
93
+ | **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs |
94
+ | **6-bit** | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample |
95
+
96
+ > Figures are indicative for **language-only** Tiny (no vision), and will vary with context length and KV cache size.
97
+
98
+ ---
99
+
100
+ ## 🚀 Quickstart (CLI — MLX)
101
+ ```bash
102
+ # Plain generation (deterministic)
103
+ python -m mlx_lm.generate \
104
+ --model <this-repo-id> \
105
+ --prompt "Summarize the following notes into 5 bullet points:\n<your text>" \
106
+ --max-tokens 200 \
107
+ --temperature 0.0 \
108
+ --device mps \
109
+ --seed 0