Not able to deploy gpt-oss-20b model in A100s

#124
by saiadityavzure - opened

Not able to deploy gpt-oss-20b model in A100s (40GB * 2) models.
Any details of how to deploy ?

Hey Community,

I have two A100 GPUs (40GB each) and I’m trying to deploy the GPT OSS 20B model. However, I’m encountering an FA3 error both with NVIDIA NIM and with other providers.

I’ll post the exact error details below for reference. Any guidance, troubleshooting tips, or insights would be greatly appreciated.

Thanks in advance for your support!

This is the error we are getting in the below screenshot.
We are using VLLM as platform
image.png

@saiadityavzure check out triton kernel attention backend. The issue is that A100 is Ampere architecture which does not support MXFP4 natively. Here is the link from vLLM:
https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#quickstart

Quick follow-up question: did you eventually manage to get gpt-oss-20b running reliably on your A100s with vLLM?

If yes, could you please share:

vLLM version you ended up using,
VLLM_ATTENTION_BACKEND value (if you set one),
and the full vllm serve ... command or deployment/Helm config that worked for you?

I’m trying to self-host gpt-oss-20b on A100 as well (Ampere, MXFP4), so a concrete working example would be super helpful. Thanks!

I use v0.10.2 for gpt-oss models.

# https://github.com/vllm-project/vllm/issues/22331#issuecomment-3167520881
# A100FIX="--env VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1"
# FLASHINFER, XFORMERS, FLEX_ATTN, FLASH_ATTN_V2, TORCH_SDPA did not work :'(
# FLASH_ATTN_V2 --> TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'sinks'
# TORCH_SDPA --> is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=TORCH_SDPA from your config in favor of the V1 Engine.
# ^ above issue is seems solved with 0.10.2
# However, 0.11+ has issues with gpt-oss:
# https://github.com/vllm-project/vllm/issues/29641
# Max Tokens not being honoured in Chat Completions for GPTOSS model. Setting max_tokens (even to a large value like 2048) makes the response sometimes blank even with low reasoning effort!
# A100FIX="--env VLLM_USE_FLASHINFER_SAMPLER=0" # https://github.com/vllm-project/vllm/issues/26480 <-- setting it did not help for v0.11+ 

# https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#a100
ASYNC_SCHEDULE="--async-scheduling"

MAX_MODEL_LEN="--max-model-len 131072"
# https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#function-calling
FUNCTION_CALL="--tool-call-parser openai --enable-auto-tool-choice " # --reasoning-parser openai_gptoss

# https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#quickstart says vllm/vllm-openai:v0.10.1
# https://github.com/vllm-project/vllm/issues/22308#issuecomment-3324560851 suggests to use v0.10.2 for --tool-call-parser openai
IMAGE_NAME=vllm/vllm-openai:v0.10.2

RUN="podman run --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d --device nvidia.com/gpu=all --network=host -v $HOME/.cache/huggingface:/root/.cache/huggingface --env HUGGING_FACE_HUB_TOKEN=$HF_TOKEN --env VLLM_USE_V1=1 --env VLLM_ENABLE_V1_MULTIPROCESSING=1 -p ${LOGIN_PORT}:${LOGIN_PORT} --ipc=host $IMAGE_NAME --model $MODEL --tensor-parallel-size $NUM_CORES --host localhost --port ${LOGIN_PORT} --seed 42 --root-path /${LOGIN_PORT} $ASYNC_SCHEDULE $MAX_MODEL_LEN $FUNCTION_CALL"

Sign up or log in to comment