Empty Tokens Generated with Llama 4 Maverick NVFP4
#1
by
bennorris12345
- opened
Hi,
I have deployed this model with vllm (deployment command below). The model loads successfully, however, when I throw a few prompts in, the model simply stream empty tokens with no content. Is there anything blatantly obvious about my deployment command? I have tried tensor parallel 2 and 4 across Nvidia B200 GPUs. Did anyone else experience the same issue?
- name: vllm
image: vllm/vllm-openai:v0.11.2
command:
- vllm
- serve
- RedHatAI/Llama-4-Maverick-17B-128E-Instruct-NVFP4
- '--host'
- 0.0.0.0
- '--port'
- '8000'
- '--tensor-parallel-size'
- '2'
- '--limit-mm-per-prompt.image'
- '10'
- '--max-model-len'
- '4096'
- '--tool-call-parser'
- llama4_json
ports:
- name: container-port
containerPort: 8000
protocol: TCP
- name: zmq-port
containerPort: 55555
protocol: TCP
- name: ucx-port
containerPort: 9999
protocol: TCP
env:
- name: HF_HOME
value: /data
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: LMCACHE_LOG_LEVEL
value: DEBUG
- name: PROMETHEUS_MULTIPROC_DIR
value: /tmp
- name: VLLM_API_KEY
valueFrom:
secretKeyRef:
name: vllm-prod-stack-v3-secrets
key: vllmApiKey
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: HF_TOKEN
- name: VLLM_LOGGING_LEVEL
value: DEBUG
- name: DO_NOT_TRACK
value: '1'
- name: VLLM_USE_FLASHINFER_MOE_FP4
value: '1'