File size: 5,537 Bytes
3fcbb7d 358eac8 4c05c40 358eac8 4c05c40 358eac8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
model-index:
- name: >-
Granite-4.0-H-Tiny — MLX (Apple Silicon), 3-bit (plus guidance for
2/4/5/6-bit)
results: []
license: apache-2.0
language:
- en
tags:
- ibm
- granite
- mlx
- apple-silicon
- mamba2
- transformer
- hybrid
- moe
- long-context
- instruct
- quantized
- MoE
pipeline_tag: text-generation
library_name: mlx
base_model:
- ibm-granite/granite-4.0-h-tiny
---
# Granite-4.0-H-Tiny — **MLX 3-bit** (Apple Silicon)
**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)
This repository provides an **Apple-Silicon-optimized MLX build** of **IBM Granite-4.0-H-Tiny** with **3-bit** weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows).
Granite 4.0 is IBM’s latest **hybrid Mamba-2/Transformer** family with selective **Mixture-of-Experts (MoE)**, designed for **long-context**, **hyper-efficient** inference and **enterprise** use. :contentReference[oaicite:0]{index=0}
---
## 🔎 What’s Granite 4.0?
- **Architecture.** Hybrid **Mamba-2 + softmax attention**; *H* variants add **MoE** routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1}
- **Efficiency claims.** Up to **~70% lower memory** and **~2× faster** inference vs. comparable models, especially for **multi-session** and **long-context** scenarios. :contentReference[oaicite:2]{index=2}
- **Context window.** **128k** tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3}
- **Licensing.** **Apache-2.0** for public/commercial use. :contentReference[oaicite:4]{index=4}
> This MLX build targets **Granite-4.0-H-Tiny** (≈ **7B total**, ≈ **1B active** parameters). For reference, the family also includes **H-Small (≈32B total / 9B active)** and **Micro/Micro-H (≈3B dense/hybrid)** tiers. :contentReference[oaicite:5]{index=5}
---
## 📦 What’s in this repo (MLX format)
- `config.json` (MLX), `mlx_model*.safetensors` (3-bit shards), tokenizer files, and processor metadata.
- Ready for **macOS** on **M-series** chips via **Metal/MPS**.
> The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: **ibm-granite/granite-4.0-h-tiny**. :contentReference[oaicite:6]{index=6}
---
## ✅ Intended use
- General **instruction-following** and **chat** with **long context** (128k). :contentReference[oaicite:7]{index=7}
- **Enterprise** assistant patterns (function calling, structured outputs) and **RAG** backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8}
- **On-device** development on Macs (MLX), low-latency local prototyping and evaluation.
## ⚠️ Limitations
- As a quantized, decoder-only LM, it can produce **confident but wrong** outputs—review for critical use.
- **2–4-bit** quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows.
- Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9}
---
## 🧠 Model family at a glance
| Tier | Arch | Params (total / active) | Notes |
|---|---|---:|---|
| **H-Small** | Hybrid + **MoE** | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} |
| **H-Tiny** *(this repo)* | Hybrid + **MoE** | ~7B / **1B** | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} |
| **Micro / H-Micro** | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} |
**Context Window:** up to **128k** tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13}
**License:** Apache-2.0. :contentReference[oaicite:14]{index=14}
---
## 🧪 Observed on-device behavior (MLX)
Empirically on M-series Macs:
- **3-bit** often gives **crisp, direct** answers with good latency and modest RAM.
- **Higher bit-widths** (4/5/6-bit) improve faithfulness on **fine-grained** tasks (tiny OCR, structured parsing), at higher memory cost.
> Performance varies by Mac model, image/token lengths, and temperature; validate on your workload.
---
## 🔢 Choosing a quantization level (Apple Silicon)
| Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose |
|---|---:|:---:|---|---|
| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests |
| **3-bit** *(this build)* | ~5–6 GB | **🔥🔥🔥🔥** | **Direct, concise**, great latency | **Default** for local dev on M1/M2/M3/M4 |
| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness |
| **5-bit** | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs |
| **6-bit** | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample |
> Figures are indicative for **language-only** Tiny (no vision), and will vary with context length and KV cache size.
---
## 🚀 Quickstart (CLI — MLX)
```bash
# Plain generation (deterministic)
python -m mlx_lm.generate \
--model <this-repo-id> \
--prompt "Summarize the following notes into 5 bullet points:\n<your text>" \
--max-tokens 200 \
--temperature 0.0 \
--device mps \
--seed 0 |