vLLM deployment guide on 4× H100 fails with torch.AcceleratorError: CUDA-capable device(s) is/are busy or unavailable during engine startup

by Svngoku - opened Dec 26, 2025

Dec 26, 2025

Hi MiniMax team,

I am trying to follow the official vLLM deployment guide for MiniMaxAI/MiniMax-M2.1 on a 4× H100 node, but the engine fails to start with a CUDA error, even though all GPUs are healthy and visible.

Environment

GPUs: 4× NVIDIA H100 PCIe 80GB (same node)
nvidia-smi: all 4 GPUs visible, idle, almost no memory used, no running processes
PyTorch CUDA view:
- torch.cuda.device_count() == 4
- torch.cuda.is_available() == True
- Devices: 0, 1, 2, 3 → all NVIDIA H100 PCIe
CUDA runtime: 12.8
Driver: 570.195.03
Python: 3.12
vLLM: installed from wheels.vllm.ai/nightly as recommended in the guide
Model: MiniMaxAI/MiniMax-M2.1

Command (from your vLLM guide)

SAFETENSORS_FAST_GPU=1 vllm serve \
  MiniMaxAI/MiniMax-M2.1 --trust-remote-code \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

I also tried adding the suggested flag from the guide:

--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"

but the behavior did not change.

Error

The engine starts initializing, logs the expected config (model, TP=4, bf16, fp8 quant, etc.), then fails during worker startup with:

torch.AcceleratorError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
...
WorkerProc initialization failed due to an exception in a background process.
Engine core initialization failed. See root cause above.

This happens inside torch.cuda.set_device(device) in vllm/platforms/cuda.py, i.e., when vLLM is assigning devices to workers. There are no other processes on the GPUs at that time (verified via nvidia-smi), and TP=4 should be valid because all four H100s are visible and empty.

What I already tried

Verified GPUs and visibility:
- nvidia-smi: 4 GPUs, no processes
- Python check: torch.cuda.device_count() == 4, names printed correctly
- CUDA_VISIBLE_DEVICES is empty (all GPUs available)
Tried:
- VLLM_SKIP_CUDA_P2P_CHECK=1 / VLLM_SKIP_NCCL_P2P_CHECK=1
- tensor-parallel-size 1 (TP=1 works, TP=4 fails with the error above)
- Fresh virtualenv with nightly vLLM only, no older vLLM versions

Given that:

The command line matches your guide,
Hardware matches your “4× high-end GPUs” recommendation,
TP=1 works but TP>1 fails at device init on otherwise healthy H100s,

this looks like a specific interaction between MiniMax-M2.1’s fp8/bf16 setup, the vLLM nightly recommended in the docs, and H100 on CUDA 12.8.

Questions

Is there a known-good vLLM version / nightly build (commit or version pin) that you recommend for running MiniMaxAI/MiniMax-M2.1 on 4× H100 with TP=4?
Are there any additional flags you recommend (e.g. disabling custom all-reduce, specific compilation-config, env vars) for H100 deployments beyond what’s in the vLLM guide?
Could you provide an example of a full, working vllm serve invocation + environment spec (driver, CUDA, torch, vLLM) that you have verified on 4× H100 with M2.1?

Thanks in advance for any guidance, and for publishing this model and the vLLM deployment recipe.

Svngoku

Dec 26, 2025

Btw It works very well on 4x NVIDIA H100 NVL with CUDA 12.8.1.

jeeejeee

MiniMax org Dec 27, 2025

I also can start this model on 4x H100(NVIDIA H100 80GB HBM3)correctly by referring to vllm recipe
I'm not sure, but maybe you can try --disable-custom-all-reduce.

rajamummidi-tiktok

Jan 15

Are you able run this model on a single Node with 4 x H100s?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment