sglang inference

#9
by owao - opened

I'm super mega duper hype to experiment with your model as I already found the previous iteration to be unique!

Just sharing the sglang serve command I use to run the model in BF16:

  "Nanbeige4.1-3B_sglang_131K":
    cmd: |
      sglang serve
      --model-path Nanbeige/Nanbeige4.1-3B
      --host 0.0.0.0
      --trust-remote-code
      --enable-torch-compile
      --tp-size 1
      # --disable-cuda-graph <-- if you disable it, the model loads faster, but you'll pay it as a lower throughput, recommend to let it enabled
      --reasoning-parser qwen3
      --tool-call-parser qwen
      --context-length 131072
      --port ${PORT}

On a RTX 3090 (just a bit undervolted), this gives a very good throughput even deep in the context window, so that's perfect for model with very long CoTs like this :)
Starting at ~110t/s @50 tokens to only fall to ~100t/s @32000 tokens vs starting at ~81t/s through llama-cpp in BF16 as well. But the major benefit are the prompt processing speed!
Takes ~21.5 GB of VRAM

PS: as I always recommend, give a try to llama-swap to manage all your models, it handles any inference engine (llama-cpp, sglang, vllm, etc.). You just have 1 config file and that's all, and the servers restart automatically when editing this file. The snippet is directly from my config file, that's why it doesn't contain the eventual \ before each newlines the bash command would have if ran directly. Do yourself a favor and ditch out ollama...!
If anyone is interested, I'll be more than happy to share my config file and llama-swap cmd as a starting point example!
Just ask!

hey!
I assume this wont be supported by vllm ootb right? The base model is not a derivative of some common players right?

Hey @skhadloya I didn't try but yes it should be supported by vllm as the architecture is the good old, well supported LlamaForCausalLM

You can find it in config.json ;)

  "architectures": [
    "LlamaForCausalLM"
  ],

Oh nice - Missed this, thanks!!

Sign up or log in to comment