Text Generation
Transformers
Safetensors
minimax_m2
conversational
custom_code
fp8

Inference speed issue with local deployment on H800

#44
by jiutu - opened

I followed the official vLLM deployment commands to set up a local deployment on 8x H800 GPUs. However, the inference speed is only around 20 tokens/s. Is this normal?

If you deploy using the official Sglang deployment commands, the inference speed should be 98 tokens/s.
At least on my server, it is like this.

Sign up or log in to comment