modelopt NVFP4 quantized MiniMax-M2

Instructions from another user, running on RTX Pro 6000 Blackwell:

export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1

# Run on 2 GPUs with tensor parallelism
CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
  --host 0.0.0.0 \
  --port 8345 \
  --served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --pipeline-parallel-size 1 \
  --enable-expert-parallel \
  --tensor-parallel-size 2 \
  --max-model-len 196608 \
  --max-num-seqs 32 \
  --enable-auto-tool-choice \
  --reasoning-parser minimax_m2_append_think \
  --tool-call-parser minimax_m2 \
  --all2all-backend pplx \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --dtype auto \
  --kv-cache-dtype fp8
Environment:

  Python: 3.12.3
  vLLM: 0.11.2.dev360+g8e7a89160
  PyTorch: 2.9.0+cu130
  CUDA: 13.0
  GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  Triton: 3.5.0
  FlashInfer: 0.5.3

Tested (but not extensively validated) on 2x RTX Pro 6000 Blackwell via: (note that these instructions no longer work due to nightly vLLM breaking NVFP4 support)

 inference:
    image: vllm/vllm-openai:nightly
    container_name: inference
    ports:
      - "0.0.0.0:8000:8000"
    gpus: "all"
    shm_size: "32g"
    ipc: "host"
    ulimits:
      memlock: -1
      nofile: 1048576
    environment:
      - NCCL_IB_DISABLE=1
      - NCCL_NVLS_ENABLE=0
      - NCCL_P2P_DISABLE=0
      - NCCL_SHM_DISABLE=0
      - VLLM_USE_V1=1
      - VLLM_USE_FLASHINFER_MOE_FP4=1
      - OMP_NUM_THREADS=8
      - SAFETENSORS_FAST_GPU=1
    volumes:
      - /dev/shm:/dev/shm
    command:
      - lukealonso/MiniMax-M2-NVFP4
      - --enable-auto-tool-choice
      - --tool-call-parser
      - minimax_m2
      - --reasoning-parser
      - minimax_m2_append_think
      - --all2all-backend
      - pplx
      - --enable-expert-parallel
      - --enable-prefix-caching
      - --enable-chunked-prefill
      - --served-model-name
      - "MiniMax-M2"
      - --tensor-parallel-size
      - "2"
      - --gpu-memory-utilization
      - "0.95"
      - --max-num-batched-tokens
      - "16384"
      - --dtype
      - "auto"
      - --max-num-seqs
      - "8"
      - --kv-cache-dtype
      - fp8
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
Downloads last month
1,495
Safetensors
Model size
115B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lukealonso/MiniMax-M2-NVFP4

Quantized
(43)
this model