MiniMax-M2.5-Hybrid-AWQ-W4A16G128-Attn-fp8_e4m3-KV-fp8_e4m3
A hybrid AWQ int4 + fp8 attention + fp8 KV cache of MiniMaxAI/MiniMax-M2.5 (~229B parameters, 256 experts per layer) that fits on 4x RTX A6000 (192 GB) (Ampere) with ~370,000 tokens of KV cache (more than doubled!). This currently requires vLLM patches to run!
What makes this special
This is not a straightforward quantization, it's tuned for extreme VRAM efficiency:
Expert MLP layers (w1/w2/w3 — 224.7B params, 98.3% of the model) are quantized to AWQ int4 with group_size=128. These are the bulk of the parameters and benefit most from compression.
Attention layers (q/k/v/o_proj — 2.7B params) are kept in their original fp8_e4m3fn with per-block scales (128x128 blocks). These are quality-sensitive and small enough that keeping higher precision is worthwhile. vLLM requires a patch to accept this.
KV cache uses fp8_e4m3 with per-layer calibrated scales derived during the AWQ calibration pass. This doubles the KV cache token capacity compared to bf16. On our 4x A6000 deployment, this increased the KV cache from ~160,000 tokens to ~370,000 tokens. vllm auto-detects fp8 KV cache from
kv_cache_schemein the checkpoint config — no--kv-cache-dtypeflag needed. A patch for vLLM is needed to support this.Embeddings, LM head, norms, and MoE gates stay in their original bf16/fp32 precision. These are tiny but highly sensitive to quantization.
AWQ calibration
The AWQ calibration was run with these settings:
- 128 calibration samples from a curated dataset (40 code, 40 multilingual, 24 reasoning, 24 diverse)
- Sequence length 1024
- n_grid=20 (AWQ grid search resolution)
- Used this fork of llm-compressor with MiniMax-M2 support, plus custom patches.
Total quantization time was ~14 hours on a single A6000 GPU.
Per-component breakdown
| Component | Params | Dtype |
|---|---|---|
| Expert w1/w2/w3 | 224.7B (98.3%) | AWQ int4, group_size=128 |
| Attention q/k/v/o_proj | 2.7B (1.2%) | fp8_e4m3fn, block scales [128,128] |
| Embeddings (embed_tokens) | 0.6B | bf16 |
| LM head | 0.6B | bf16 |
| MoE gates | 0.05B | fp32 |
| Norms, biases | <0.01B | bf16/fp32 |
| KV cache | runtime | fp8_e4m3, calibrated scales |
Software requirements
This checkpoint uses features that are not yet in stable vllm releases. You need:
| Package | Minimum version | Notes |
|---|---|---|
| vllm | 0.16.0rc2.dev207 or newer nightly |
Block fp8 support (PR #33280). Install from --extra-index-url https://wheels.vllm.ai/nightly |
| compressed-tensors | 0.13.1a20260212 |
Alpha release, ships with the vllm nightly |
| torch | 2.10.0+cu128 |
Pulled in by the vllm nightly |
| flashinfer | 0.6.3 |
Attention backend for Ampere GPUs |
Required patches (2 bugs in vllm as of 0.16.0rc2.dev250)
These are bugs in vllm's compressed-tensors integration that affect mixed int4+fp8 checkpoints with calibrated KV cache scales. Both must be applied to the vllm installation before serving this checkpoint.
Patch 1: Block-quantized fp8 weight loading crash
File: vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py
The is_static_input_scheme field is None (not False) when no input quantization is configured. Python's None is False evaluates to False, causing an assertion failure when loading block-quantized fp8 attention weights.
--- a/compressed_tensors_w8a16_fp8.py
+++ b/compressed_tensors_w8a16_fp8.py
@@ -111,7 +111,7 @@
size_k_first = True
# TODO(rob): refactor block quant into separate class.
if self.strategy == QuantizationStrategy.BLOCK:
- assert self.is_static_input_scheme is False
+ assert not self.is_static_input_scheme
size_k_first = False
weight, weight_scale = process_fp8_weight_block_strategy(
weight, weight_scale
Patch 2: KV cache calibrated scales silently ignored
File: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Without this patch, the calibrated k_scale/v_scale values are loaded into tensor attributes but never copied to the float fields that FlashInfer actually reads during attention computation. The KV cache silently uses scale=1.0 instead of the calibrated values, producing subtly wrong results.
--- a/compressed_tensors.py
+++ b/compressed_tensors.py
@@ -1089,6 +1089,16 @@
layer._v_scale = layer.v_scale
layer._q_scale = layer.q_scale
+ # Also set the float scales used by FlashInfer for cache reads.
+ # Without this, _k_scale_float/_v_scale_float stay at the default 1.0
+ # from set_default_quant_scales(), causing incorrect dequantization
+ # during attention computation.
+ if layer.k_scale.numel() == 1:
+ layer._k_scale_float = layer.k_scale.item()
+ if layer.v_scale.numel() == 1:
+ layer._v_scale_float = layer.v_scale.item()
+ if layer.q_scale.numel() == 1:
+ layer._q_scale_float = layer.q_scale.item()
+
# Discard all placeholders.
del layer.k_scale
del layer.v_scale
You can apply these patches by editing the files directly in your vllm site-packages directory
Model architecture
MiniMaxM2ForCausalLM (~229B params, 62 decoder layers)
embed_tokens [200064, 3072] bf16
layers (x62)
input_layernorm [3072] bf16
self_attn
q_proj [3072 -> 6144] fp8_e4m3fn + block scale [48, 24]
k_proj [3072 -> 1024] fp8_e4m3fn + block scale [8, 24]
v_proj [3072 -> 1024] fp8_e4m3fn + block scale [8, 24]
o_proj [6144 -> 3072] fp8_e4m3fn + block scale [24, 48]
post_attention_layernorm [3072] bf16
block_sparse_moe
gate [3072 -> 256] fp32
experts (x256)
w1 (gate_proj) [3072 -> 1536] AWQ int4 g128
w2 (down_proj) [1536 -> 3072] AWQ int4 g128
w3 (up_proj) [3072 -> 1536] AWQ int4 g128
norm [3072] bf16
lm_head [3072 -> 200064] bf16
Expert MLP: output = w2(silu(w1(x)) * w3(x)) (SwiGLU), 8 of 256 experts activated per token.
Quantization method
- Phase A: Dequantize original fp8 weights to bf16 on disk (AWQ requires float arithmetic that fp8 cannot perform).
- Phase B: Run AWQ calibration on expert MLP layers with
kv_cache_schemeto simultaneously calibrate KV cache scales. - Phase C: Swap bf16 attention weights back to the original fp8 tensors with proper compressed-tensors naming.
The compressed-tensors format in config.json defines two quantization groups: pack-quantized (int4 AWQ for experts) and float-quantized (fp8 block scales for attention), plus a kv_cache_scheme for fp8 KV cache with calibrated per-layer scales.
- Downloads last month
- 17
Model tree for EliasOenal/MiniMax-M2.5-Hybrid-AWQ-W4A16G128-Attn-fp8_e4m3-KV-fp8_e4m3
Base model
MiniMaxAI/MiniMax-M2.5