ahoybrotherbear
/

MiniMax-M2.5-3bit-MLX

+---
+base_model: MiniMaxAI/MiniMax-M2.5
+library_name: mlx
+tags:
+  - mlx
+  - quantized
+  - 3bit
+  - minimax_m2
+  - text-generation
+  - conversational
+  - apple-silicon
+license: other
+license_name: modified-mit
+license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/blob/main/LICENSE
+pipeline_tag: text-generation
+---
+# MiniMax-M2.5 3-bit MLX
+This is a 3-bit quantized [MLX](https://github.com/ml-explore/mlx) version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), converted using [mlx-lm](https://github.com/ml-explore/mlx-lm) v0.30.7.
+MiniMax-M2.5 is a 229B parameter Mixture of Experts model (10B active parameters) that achieves 80.2% on SWE-Bench Verified and is SOTA in coding, agentic tool use, and search tasks.
+## Important: Quality Note
+**This is an aggressive quantization.** Independent testing by [inferencerlabs](https://huggingface.co/inferencerlabs/MiniMax-M2.5-MLX-9bit) shows significant quality degradation below 4 bits for this model (q3.5 scored 43% token accuracy vs 91%+ at q4.5). This 3-bit quant was manually tested on coding and reasoning tasks and produced coherent output, but expect noticeable quality loss compared to 4-bit and above.
+**If you have 256GB+ of RAM, use the [4-bit quant](https://huggingface.co/mlx-community/MiniMax-M2.5-4bit) instead.** This 3-bit version is primarily useful for machines with 192GB of unified memory where the 4-bit version won't fit.
+## Requirements
+- Apple Silicon Mac (M2 Ultra or later)
+- At least 192GB of unified memory
+## Quick Start
+Install mlx-lm:
+```
+pip install -U mlx-lm
+```
+### CLI
+```bash
+mlx_lm.generate \
+  --model ahoybrotherbear/MiniMax-M2.5-3bit-MLX \
+  --prompt "Hello, how are you?" \
+  --max-tokens 256 \
+  --temp 0.7
+```
+### Python
+```python
+from mlx_lm import load, generate
+model, tokenizer = load("ahoybrotherbear/MiniMax-M2.5-3bit-MLX")
+messages = [{"role": "user", "content": "Hello, how are you?"}]
+prompt = tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+response = generate(
+    model, tokenizer,
+    prompt=prompt,
+    max_tokens=256,
+    temp=0.7,
+    verbose=True
+)
+print(response)
+```
+## Conversion Details
+- **Source model**: [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (FP8)
+- **Converted with**: mlx-lm v0.30.7
+- **Quantization**: 3-bit (3.501 average bits per weight)
+- **Original parameters**: 229B total / 10B active (MoE)
+- **Peak memory during inference**: ~100GB
+- **Generation speed**: ~54 tokens/sec on M3 Ultra
+## Original Model
+MiniMax-M2.5 was created by [MiniMaxAI](https://huggingface.co/MiniMaxAI). See the [original model card](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) for full details on capabilities, benchmarks, and license terms.