MiniMax-M2-GPTQ-Int4
This repository contains a 4-bit quantized version of the MiniMax-M2 model.
Quantization Details
The quantization was performed using GPTQModel with an experimental modification that feeds the whole dataset to each expert to achieve improved quality.
Calibration Dataset: The dataset used during quantization consists of 1536 samples: c4/en (1024), arc (164), gsm8k (164), humaneval (164), alpaca (20)
Hardware & Performance: This model is verified to run with Tensor Parallel (TP) on 8x NVIDIA RTX 3090 GPUs with a context window of 192,500 tokens.
Quick Start
To serve the model using vLLM, please use the following branch which includes specific fixes for loading the model: https://github.com/avtc/vllm/tree/feature/fix-gptq-m2-load-gemini
Sample run command (8x 3090):
export VLLM_ATTENTION_BACKEND="FLASHINFER"
export TORCH_CUDA_ARCH_LIST="8.6"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_MARLIN_USE_ATOMIC_ADD=1
export SAFETENSORS_FAST_GPU=1
vllm serve avtc/MiniMax-M2-GPTQMODEL-W4A16 \
-tp 8 \
--port 8000 \
--host 0.0.0.0 \
--uvicorn-log-level info \
--trust-remote-code \
--gpu-memory-utilization 0.925 \
--max-num-seqs 1 \
--trust-remote-code \
--dtype=float16 \
--seed 1234 \
--max-model-len 192500 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--enable-auto-tool-choice \
--enable-sleep-mode \
--compilation-config '{"level": 3, "cudagraph_capture_sizes": [1], "cudagraph_mode": "PIECEWISE"}'
Recommended Sampling Parameters:
{
"top_p": 0.95,
"temperature": 1.0,
"repetition_penalty": 1.05,
"top_k": 40,
"min_p": 0.0
}
For some tasks temperature 0.6 is better.
Example Output
Prompt:
Make an html animation of fishes in an aquarium. The aquarium is pretty, the fishes vary in colors and sizes and swim realistically. You can left click to place a piece of fish food in aquarium. Each fish chases a food piece closest to it, trying to eat it. Once there are no more food pieces, fishes resume swimming as usual.
Result: The model generated a working artifact using Kilo Code in Code mode. View the Result on JSFiddle
Acknowledgments
Special thanks to GPTQModel team for the quantization tools and support.
✨ Original Model Highlights
Meet MiniMax-M2
Today, we release and open source MiniMax-M2, a Mini model built for Max coding & agentic workflows.
MiniMax-M2 redefines efficiency for agents. It's a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever.
Highlights
Superior Intelligence. According to benchmarks from Artificial Analysis, MiniMax-M2 demonstrates highly competitive general intelligence across mathematics, science, instruction following, coding, and agentic tool use. Its composite score ranks #1 among open-source models globally.
Advanced Coding. Engineered for end-to-end developer workflows, MiniMax-M2 excels at multi-file edits, coding-run-fix loops, and test-validated repairs. Strong performance on Terminal-Bench and (Multi-)SWE-Bench–style tasks demonstrates practical effectiveness in terminals, IDEs, and CI across languages.
Agent Performance. MiniMax-M2 plans and executes complex, long-horizon toolchains across shell, browser, retrieval, and code runners. In BrowseComp-style evaluations, it consistently locates hard-to-surface sources, maintains evidence traceable, and gracefully recovers from flaky steps.
Efficient Design. With 10 billion activated parameters (230 billion in total), MiniMax-M2 delivers lower latency, lower cost, and higher throughput for interactive agents and batched sampling—perfectly aligned with the shift toward highly deployable models that still shine on coding and agentic tasks.
Why activation size matters
By maintaining activations around 10B , the plan → act → verify loop in the agentic workflow is streamlined, improving responsiveness and reducing compute overhead:
Faster feedback cycles in compile-run-test and browse-retrieve-cite chains.
More concurrent runs on the same budget for regression suites and multi-seed explorations.
Simpler capacity planning with smaller per-request memory and steadier tail latency.
In short: 10B activations = responsive agent loops + better unit economics.
At a glance
If you need frontier-style coding and agents without frontier-scale costs, MiniMax-M2 hits the sweet spot: fast inference speeds, robust tool-use capabilities, and a deployment-friendly footprint.
We look forward to your feedback and to collaborating with developers and researchers to bring the future of intelligent collaboration one step closer.
Tool Calling Guide
Please refer to our Tool Calling Guide.
Contact Us
Contact us at model@minimax.io | WeChat.
- Downloads last month
- 194
Model tree for avtc/MiniMax-M2-GPTQMODEL-W4A16
Base model
MiniMaxAI/MiniMax-M2