8 tps on nVidia H200
#17
by
svilen333
- opened
Hi, I am testing the model on 1 x nVidia H200 with latest vLLM, is it normal to get 8 tps using 128K context or I am doing something wrong?
Hi
That is not normal for sure, how many concurrent request are you doing?
Only one request. Using the BF16 version.
Yea then something is wrong, the auto calibrator might not have picked up the top_k and top parameters. Whats your input length and output length on test ?
Input length 15 tokens, output is over 1000. Just gave task to code html+js simple task.