vllm 部署oom

#22
by Chris2me - opened

8张h100, 会oom
命令:
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only

8张h100, 会oom
命令:
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only

肯定的啊,640G显存怎么能装得下800G模型

等fp8版本吧

reasoning-parser qwen3 --language-model-only

I have the same trouble:
Issue: Cannot start inference server for Qwen3.5-397B-A17B on 8xH100 (80GB) with Python 3.12

Command used:


python -m sglang.launch_server \

--model-path Qwen/Qwen3.5-397B-A17B \

--host 0.0.0.0 \

--port 8000 \

--tp-size 8 \

--ep-size 8 \

--context-length 8192 \

--model-impl sglang \

--mem-fraction-static 0.60

Error:


torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 4 has a total capacity of 79.19 GiB of which 104.25 MiB is free. Including non-PyTorch memory, this process has 79.08 GiB memory in use. Of the allocated memory 77.28 GiB is allocated by PyTorch, and 16.43 MiB is reserved by PyTorch but unallocated.

What I've tried:

· Reducing context length (--context-length 8192)

· Lowering memory fraction (--mem-fraction-static 0.60)

Problem: The server still crashes with OOM on startup. It seems like the model is trying to use almost all available memory (~79 GiB per GPU), leaving no room for overhead or allocations.

System:

· 8x NVIDIA H100 80GB

· Python 3.12

· PyTorch with CUDA support

· sglang for inference

Question: How can I successfully run this model on 8xH100? Are there additional settings or optimizations I should try?

IT'S BECAUSE THE BF16 WEIGHTS ARE ALREADY 800GB, SO IT IS IMPOSSIBLE TO LOAD THEM ON 8×H100 80GB (640GB TOTAL)—EVEN IF YOU SET THE CONTEXT LENGTH TO ZERO!

Sign up or log in to comment