Empty Tokens Generated with Llama 4 Maverick NVFP4

#1
by bennorris12345 - opened

Hi,

I have deployed this model with vllm (deployment command below). The model loads successfully, however, when I throw a few prompts in, the model simply stream empty tokens with no content. Is there anything blatantly obvious about my deployment command? I have tried tensor parallel 2 and 4 across Nvidia B200 GPUs. Did anyone else experience the same issue?

    - name: vllm
      image: vllm/vllm-openai:v0.11.2
      command:
        - vllm
        - serve
        - RedHatAI/Llama-4-Maverick-17B-128E-Instruct-NVFP4
        - '--host'
        - 0.0.0.0
        - '--port'
        - '8000'
        - '--tensor-parallel-size'
        - '2'
        - '--limit-mm-per-prompt.image'
        - '10'
        - '--max-model-len'
        - '4096'
        - '--tool-call-parser'
        - llama4_json
      ports:
        - name: container-port
          containerPort: 8000
          protocol: TCP
        - name: zmq-port
          containerPort: 55555
          protocol: TCP
        - name: ucx-port
          containerPort: 9999
          protocol: TCP
      env:
        - name: HF_HOME
          value: /data
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: LMCACHE_LOG_LEVEL
          value: DEBUG
        - name: PROMETHEUS_MULTIPROC_DIR
          value: /tmp
        - name: VLLM_API_KEY
          valueFrom:
            secretKeyRef:
              name: vllm-prod-stack-v3-secrets
              key: vllmApiKey
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: HF_TOKEN
        - name: VLLM_LOGGING_LEVEL
          value: DEBUG
        - name: DO_NOT_TRACK
          value: '1'
        - name: VLLM_USE_FLASHINFER_MOE_FP4
          value: '1'

Sign up or log in to comment