ahoybrotherbear
/

MiniMax-M2.5-3bit-MLX

Text Generation

Model card Files Files and versions

MiniMax-M2.5-3bit-MLX / README.md

ahoybrotherbear's picture

ahoybrotherbear

Update README.md

5c4f648 verified 8 days ago

|

history blame contribute delete

2.87 kB

	---
	base_model: MiniMaxAI/MiniMax-M2.5
	library_name: mlx
	tags:
	- mlx
	- quantized
	- 3bit
	- minimax_m2
	- text-generation
	- conversational
	- apple-silicon
	license: other
	license_name: modified-mit
	license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/blob/main/LICENSE
	pipeline_tag: text-generation
	---

	# MiniMax-M2.5 3-bit MLX

	⚠️ UPLOAD IN PROGRESS -- model files still uploading, not yet ready for use.

	This is a 3-bit quantized [MLX](https://github.com/ml-explore/mlx) version of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), converted using [mlx-lm](https://github.com/ml-explore/mlx-lm) v0.30.7.

	MiniMax-M2.5 is a 229B parameter Mixture of Experts model (10B active parameters) that achieves 80.2% on SWE-Bench Verified and is SOTA in coding, agentic tool use, and search tasks.

	## Important: Quality Note

	This is an aggressive quantization. Independent testing by [inferencerlabs](https://huggingface.co/inferencerlabs/MiniMax-M2.5-MLX-9bit) shows significant quality degradation below 4 bits for this model (q3.5 scored 43% token accuracy vs 91%+ at q4.5). This 3-bit quant was manually tested on coding and reasoning tasks and produced coherent output, but expect noticeable quality loss compared to 4-bit and above.

	If you have 256GB+ of RAM, use the [4-bit quant](https://huggingface.co/mlx-community/MiniMax-M2.5-4bit) instead. This 3-bit version is primarily useful for machines with 192GB of unified memory where the 4-bit version won't fit.

	## Requirements

	- Apple Silicon Mac (M2 Ultra or later)
	- At least 192GB of unified memory

	## Quick Start

	Install mlx-lm:

	```
	pip install -U mlx-lm
	```

	### CLI

	```bash
	mlx_lm.generate \
	--model ahoybrotherbear/MiniMax-M2.5-3bit-MLX \
	--prompt "Hello, how are you?" \
	--max-tokens 256 \
	--temp 0.7
	```

	### Python

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("ahoybrotherbear/MiniMax-M2.5-3bit-MLX")

	messages = [{"role": "user", "content": "Hello, how are you?"}]
	prompt = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)

	response = generate(
	model, tokenizer,
	prompt=prompt,
	max_tokens=256,
	temp=0.7,
	verbose=True
	)
	print(response)
	```

	## Conversion Details

	- Source model: [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) (FP8)
	- Converted with: mlx-lm v0.30.7
	- Quantization: 3-bit (3.501 average bits per weight)
	- Original parameters: 229B total / 10B active (MoE)
	- Peak memory during inference: ~100GB
	- Generation speed: ~54 tokens/sec on M3 Ultra

	## Original Model

	MiniMax-M2.5 was created by [MiniMaxAI](https://huggingface.co/MiniMaxAI). See the [original model card](https://huggingface.co/MiniMaxAI/MiniMax-M2.5) for full details on capabilities, benchmarks, and license terms.