modelopt NVFP4 quantized MiniMax-M2
Instructions from another user, running on RTX Pro 6000 Blackwell:
export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_FLASHINFER_MOE_BACKEND=throughput
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
# Run on 2 GPUs with tensor parallelism
CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
--host 0.0.0.0 \
--port 8345 \
--served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--pipeline-parallel-size 1 \
--enable-expert-parallel \
--tensor-parallel-size 2 \
--max-model-len 196608 \
--max-num-seqs 32 \
--enable-auto-tool-choice \
--reasoning-parser minimax_m2_append_think \
--tool-call-parser minimax_m2 \
--all2all-backend pplx \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--dtype auto \
--kv-cache-dtype fp8
Environment:
Python: 3.12.3
vLLM: 0.11.2.dev360+g8e7a89160
PyTorch: 2.9.0+cu130
CUDA: 13.0
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Triton: 3.5.0
FlashInfer: 0.5.3
Tested (but not extensively validated) on 2x RTX Pro 6000 Blackwell via: (note that these instructions no longer work due to nightly vLLM breaking NVFP4 support)
inference:
image: vllm/vllm-openai:nightly
container_name: inference
ports:
- "0.0.0.0:8000:8000"
gpus: "all"
shm_size: "32g"
ipc: "host"
ulimits:
memlock: -1
nofile: 1048576
environment:
- NCCL_IB_DISABLE=1
- NCCL_NVLS_ENABLE=0
- NCCL_P2P_DISABLE=0
- NCCL_SHM_DISABLE=0
- VLLM_USE_V1=1
- VLLM_USE_FLASHINFER_MOE_FP4=1
- OMP_NUM_THREADS=8
- SAFETENSORS_FAST_GPU=1
volumes:
- /dev/shm:/dev/shm
command:
- lukealonso/MiniMax-M2-NVFP4
- --enable-auto-tool-choice
- --tool-call-parser
- minimax_m2
- --reasoning-parser
- minimax_m2_append_think
- --all2all-backend
- pplx
- --enable-expert-parallel
- --enable-prefix-caching
- --enable-chunked-prefill
- --served-model-name
- "MiniMax-M2"
- --tensor-parallel-size
- "2"
- --gpu-memory-utilization
- "0.95"
- --max-num-batched-tokens
- "16384"
- --dtype
- "auto"
- --max-num-seqs
- "8"
- --kv-cache-dtype
- fp8
- --host
- "0.0.0.0"
- --port
- "8000"
- Downloads last month
- 1,495
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for lukealonso/MiniMax-M2-NVFP4
Base model
MiniMaxAI/MiniMax-M2