Inference speed issue with local deployment on H800
#44
by
jiutu
- opened
I followed the official vLLM deployment commands to set up a local deployment on 8x H800 GPUs. However, the inference speed is only around 20 tokens/s. Is this normal?
If you deploy using the official Sglang deployment commands, the inference speed should be 98 tokens/s.
At least on my server, it is like this.