MiniMax-M2.5-Hybrid-AWQ-W4A16G128-Attn-fp8_e4m3-KV-fp8_e4m3

A hybrid AWQ int4 + fp8 attention + fp8 KV cache of MiniMaxAI/MiniMax-M2.5 (~229B parameters, 256 experts per layer) that fits on 4x RTX A6000 (192 GB) (Ampere) with ~370,000 tokens of KV cache (more than doubled!). This currently requires vLLM patches to run!

What makes this special

This is not a straightforward quantization, it's tuned for extreme VRAM efficiency:

  1. Expert MLP layers (w1/w2/w3 — 224.7B params, 98.3% of the model) are quantized to AWQ int4 with group_size=128. These are the bulk of the parameters and benefit most from compression.

  2. Attention layers (q/k/v/o_proj — 2.7B params) are kept in their original fp8_e4m3fn with per-block scales (128x128 blocks). These are quality-sensitive and small enough that keeping higher precision is worthwhile. vLLM requires a patch to accept this.

  3. KV cache uses fp8_e4m3 with per-layer calibrated scales derived during the AWQ calibration pass. This doubles the KV cache token capacity compared to bf16. On our 4x A6000 deployment, this increased the KV cache from ~160,000 tokens to ~370,000 tokens. vllm auto-detects fp8 KV cache from kv_cache_scheme in the checkpoint config — no --kv-cache-dtype flag needed. A patch for vLLM is needed to support this.

  4. Embeddings, LM head, norms, and MoE gates stay in their original bf16/fp32 precision. These are tiny but highly sensitive to quantization.

AWQ calibration

The AWQ calibration was run with these settings:

  • 128 calibration samples from a curated dataset (40 code, 40 multilingual, 24 reasoning, 24 diverse)
  • Sequence length 1024
  • n_grid=20 (AWQ grid search resolution)
  • Used this fork of llm-compressor with MiniMax-M2 support, plus custom patches.

Total quantization time was ~14 hours on a single A6000 GPU.

Per-component breakdown

Component Params Dtype
Expert w1/w2/w3 224.7B (98.3%) AWQ int4, group_size=128
Attention q/k/v/o_proj 2.7B (1.2%) fp8_e4m3fn, block scales [128,128]
Embeddings (embed_tokens) 0.6B bf16
LM head 0.6B bf16
MoE gates 0.05B fp32
Norms, biases <0.01B bf16/fp32
KV cache runtime fp8_e4m3, calibrated scales

Software requirements

This checkpoint uses features that are not yet in stable vllm releases. You need:

Package Minimum version Notes
vllm 0.16.0rc2.dev207 or newer nightly Block fp8 support (PR #33280). Install from --extra-index-url https://wheels.vllm.ai/nightly
compressed-tensors 0.13.1a20260212 Alpha release, ships with the vllm nightly
torch 2.10.0+cu128 Pulled in by the vllm nightly
flashinfer 0.6.3 Attention backend for Ampere GPUs

Required patches (2 bugs in vllm as of 0.16.0rc2.dev250)

These are bugs in vllm's compressed-tensors integration that affect mixed int4+fp8 checkpoints with calibrated KV cache scales. Both must be applied to the vllm installation before serving this checkpoint.

Patch 1: Block-quantized fp8 weight loading crash

File: vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py

The is_static_input_scheme field is None (not False) when no input quantization is configured. Python's None is False evaluates to False, causing an assertion failure when loading block-quantized fp8 attention weights.

--- a/compressed_tensors_w8a16_fp8.py
+++ b/compressed_tensors_w8a16_fp8.py
@@ -111,7 +111,7 @@
         size_k_first = True
         # TODO(rob): refactor block quant into separate class.
         if self.strategy == QuantizationStrategy.BLOCK:
-            assert self.is_static_input_scheme is False
+            assert not self.is_static_input_scheme
             size_k_first = False
             weight, weight_scale = process_fp8_weight_block_strategy(
                 weight, weight_scale

Patch 2: KV cache calibrated scales silently ignored

File: vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

Without this patch, the calibrated k_scale/v_scale values are loaded into tensor attributes but never copied to the float fields that FlashInfer actually reads during attention computation. The KV cache silently uses scale=1.0 instead of the calibrated values, producing subtly wrong results.

--- a/compressed_tensors.py
+++ b/compressed_tensors.py
@@ -1089,6 +1089,16 @@
         layer._v_scale = layer.v_scale
         layer._q_scale = layer.q_scale
 
+        # Also set the float scales used by FlashInfer for cache reads.
+        # Without this, _k_scale_float/_v_scale_float stay at the default 1.0
+        # from set_default_quant_scales(), causing incorrect dequantization
+        # during attention computation.
+        if layer.k_scale.numel() == 1:
+            layer._k_scale_float = layer.k_scale.item()
+        if layer.v_scale.numel() == 1:
+            layer._v_scale_float = layer.v_scale.item()
+        if layer.q_scale.numel() == 1:
+            layer._q_scale_float = layer.q_scale.item()
+
         # Discard all placeholders.
         del layer.k_scale
         del layer.v_scale

You can apply these patches by editing the files directly in your vllm site-packages directory

Model architecture

MiniMaxM2ForCausalLM (~229B params, 62 decoder layers)
  embed_tokens                              [200064, 3072]  bf16
  layers (x62)
    input_layernorm                         [3072]          bf16
    self_attn
      q_proj                                [3072 -> 6144]  fp8_e4m3fn + block scale [48, 24]
      k_proj                                [3072 -> 1024]  fp8_e4m3fn + block scale [8, 24]
      v_proj                                [3072 -> 1024]  fp8_e4m3fn + block scale [8, 24]
      o_proj                                [6144 -> 3072]  fp8_e4m3fn + block scale [24, 48]
    post_attention_layernorm                [3072]          bf16
    block_sparse_moe
      gate                                  [3072 -> 256]   fp32
      experts (x256)
        w1 (gate_proj)                      [3072 -> 1536]  AWQ int4 g128
        w2 (down_proj)                      [1536 -> 3072]  AWQ int4 g128
        w3 (up_proj)                        [3072 -> 1536]  AWQ int4 g128
  norm                                      [3072]          bf16
  lm_head                                   [3072 -> 200064] bf16

Expert MLP: output = w2(silu(w1(x)) * w3(x)) (SwiGLU), 8 of 256 experts activated per token.

Quantization method

  1. Phase A: Dequantize original fp8 weights to bf16 on disk (AWQ requires float arithmetic that fp8 cannot perform).
  2. Phase B: Run AWQ calibration on expert MLP layers with kv_cache_scheme to simultaneously calibrate KV cache scales.
  3. Phase C: Swap bf16 attention weights back to the original fp8 tensors with proper compressed-tensors naming.

The compressed-tensors format in config.json defines two quantization groups: pack-quantized (int4 AWQ for experts) and float-quantized (fp8 block scales for attention), plus a kv_cache_scheme for fp8 KV cache with calibrated per-layer scales.

Downloads last month
17
Safetensors
Model size
34B params
Tensor type
I64
·
F32
·
I32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EliasOenal/MiniMax-M2.5-Hybrid-AWQ-W4A16G128-Attn-fp8_e4m3-KV-fp8_e4m3

Quantized
(52)
this model