Cant get it to work on 8x RTX3090

#1
by maglat - opened

I cant get vLLM to start up with M2.5 I always get a CUDA out of memory error

"RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 108.00 MiB. GPU 7 has a total capacity of 23.56 GiB of which 95.25 MiB is free. Including non-PyTorch memory, this process has 23.18 GiB memory in use. Of the allocated memory 22.22 GiB is allocated by PyTorch, and 506.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause"

I start it up in docker with following command

docker run -d
--name vllm-minimax-m2_5
--restart unless-stopped
-p 8788:8000
-v /mnt/extra/models:/root/.cache/huggingface
--gpus '"device=0,1,2,8,4,5,6,7"'
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
--ipc=host
vllm/vllm-openai:latest-cu130
mratsim/Minimax-M2.5-BF16-INT4-AWQ
--tensor-parallel-size 8
--max-num-seqs 2
--max-model-len 196608
--gpu-memory-utilization 0.96
--override-generation-config '{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
--kv-cache-dtype fp8
--reasoning-parser minimax_m2
--tool-call-parser minimax_m2
--served-model-name minimax-m2.5
--enable-auto-tool-choice
--disable-custom-all-reduce
--trust-remote-code

The same worked well with M2.1 btw. For M2.1 I used cyankiwi/MiniMax-M2.1-AWQ-4bit .
Any idea whats going on this time?

maglat changed discussion title from Cant get it to work in 8x RTX3090 to Cant get it to work on 8x RTX3090

It should work on 8x3090 see https://huggingface.co/mratsim/MiniMax-M2.1-BF16-INT4-AWQ/discussions/1

However can you try with less context first to rule out other issues apart from VRAM?

My model uses mixed precision, self-attention is left unquantized so it's like ~3GB bigger than cyankiwi's weight. Given that you have 0.96 gpu-memory-utilization that might push you over the edge.

try gpu-memory-utilization 0.9 max-model-len auto

Thank you. I got it to start up with

docker run -d
--name vllm-minimax-m2_5
--restart unless-stopped
-p 8788:8000
-v /mnt/extra/models:/root/.cache/huggingface
--gpus '"device=0,1,2,8,4,5,6,7"'
-e CUDA_DEVICE_ORDER=PCI_BUS_ID
--ipc=host
vllm/vllm-openai:latest-cu130
mratsim/Minimax-M2.5-BF16-INT4-AWQ
--tensor-parallel-size 8
--max-num-seqs 2
--max-model-len auto
--gpu-memory-utilization 0.9
--override-generation-config '{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
--kv-cache-dtype fp8
--reasoning-parser minimax_m2
--tool-call-parser minimax_m2
--served-model-name minimax-m2.5
--enable-auto-tool-choice
--disable-custom-all-reduce
--trust-remote-code

Even with

--max-model-len 196608 \

lowering GPU memory utilization did the trick. I thought, bigger is better :D
--gpu-memory-utilization 0.9 \

can you tell me you tg speed, please? I want to know how much i am losing using 4 pcs with 2 x3090 each, vs all 8 in same machine. I am getting 64t/s with one request. 110t/s with 2 parallel requests.

Following accuracy degradation concerns after using the new batch_size=32 feature in LLMcompressor I have reuploaded quants with batch_size=1 to ensure my calibration dataset is passed as-is and not truncated to the shortest sequence in the batch. Please redownload for highest quality! (see thread https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ/discussions/4)

Thank you. I got it to start up with
--kv-cache-dtype fp8 \

Although we have a pretty similar setup (8x RTX3090) for me the "--kv-cache-dtype fp8" crashes vllm with:
"(EngineCore_DP0 pid=224079) ERROR 02-20 17:47:45 [core.py:946] RuntimeError: Worker failed with error 'float8 types are not supported by dlpack', please check the stack trace above for the root cause"

Sign up or log in to comment