vLLM deployment guide on 4× H100 fails with torch.AcceleratorError: CUDA-capable device(s) is/are busy or unavailable during engine startup

#5
by Svngoku - opened

Hi MiniMax team,

I am trying to follow the official vLLM deployment guide for MiniMaxAI/MiniMax-M2.1 on a 4× H100 node, but the engine fails to start with a CUDA error, even though all GPUs are healthy and visible.

Environment

  • GPUs: 4× NVIDIA H100 PCIe 80GB (same node)
  • nvidia-smi: all 4 GPUs visible, idle, almost no memory used, no running processes
  • PyTorch CUDA view:
    • torch.cuda.device_count() == 4
    • torch.cuda.is_available() == True
    • Devices: 0, 1, 2, 3 → all NVIDIA H100 PCIe
  • CUDA runtime: 12.8
  • Driver: 570.195.03
  • Python: 3.12
  • vLLM: installed from wheels.vllm.ai/nightly as recommended in the guide
  • Model: MiniMaxAI/MiniMax-M2.1

Command (from your vLLM guide)

SAFETENSORS_FAST_GPU=1 vllm serve \
  MiniMaxAI/MiniMax-M2.1 --trust-remote-code \
  --tensor-parallel-size 4 \
  --enable-auto-tool-choice --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

I also tried adding the suggested flag from the guide:

--compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"

but the behavior did not change.

Error

The engine starts initializing, logs the expected config (model, TP=4, bf16, fp8 quant, etc.), then fails during worker startup with:

torch.AcceleratorError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
...
WorkerProc initialization failed due to an exception in a background process.
Engine core initialization failed. See root cause above.

This happens inside torch.cuda.set_device(device) in vllm/platforms/cuda.py, i.e., when vLLM is assigning devices to workers. There are no other processes on the GPUs at that time (verified via nvidia-smi), and TP=4 should be valid because all four H100s are visible and empty.

What I already tried

  • Verified GPUs and visibility:
    • nvidia-smi: 4 GPUs, no processes
    • Python check: torch.cuda.device_count() == 4, names printed correctly
    • CUDA_VISIBLE_DEVICES is empty (all GPUs available)
  • Tried:
    • VLLM_SKIP_CUDA_P2P_CHECK=1 / VLLM_SKIP_NCCL_P2P_CHECK=1
    • tensor-parallel-size 1 (TP=1 works, TP=4 fails with the error above)
    • Fresh virtualenv with nightly vLLM only, no older vLLM versions

Given that:

  • The command line matches your guide,
  • Hardware matches your “4× high-end GPUs” recommendation,
  • TP=1 works but TP>1 fails at device init on otherwise healthy H100s,

this looks like a specific interaction between MiniMax-M2.1’s fp8/bf16 setup, the vLLM nightly recommended in the docs, and H100 on CUDA 12.8.

Questions

  1. Is there a known-good vLLM version / nightly build (commit or version pin) that you recommend for running MiniMaxAI/MiniMax-M2.1 on 4× H100 with TP=4?
  2. Are there any additional flags you recommend (e.g. disabling custom all-reduce, specific compilation-config, env vars) for H100 deployments beyond what’s in the vLLM guide?
  3. Could you provide an example of a full, working vllm serve invocation + environment spec (driver, CUDA, torch, vLLM) that you have verified on 4× H100 with M2.1?

Thanks in advance for any guidance, and for publishing this model and the vLLM deployment recipe.

Btw It works very well on 4x NVIDIA H100 NVL with CUDA 12.8.1.

Sign up or log in to comment