Inference speed issue with local deployment on H800

#44

by jiutu - opened 14 days ago

14 days ago

I followed the official vLLM deployment commands to set up a local deployment on 8x H800 GPUs. However, the inference speed is only around 20 tokens/s. Is this normal?

liujiaxin123

5 days ago

If you deploy using the official Sglang deployment commands, the inference speed should be 98 tokens/s.
At least on my server, it is like this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment