sglang inference
I'm super mega duper hype to experiment with your model as I already found the previous iteration to be unique!
Just sharing the sglang serve command I use to run the model in BF16:
"Nanbeige4.1-3B_sglang_131K":
cmd: |
sglang serve
--model-path Nanbeige/Nanbeige4.1-3B
--host 0.0.0.0
--trust-remote-code
--enable-torch-compile
--tp-size 1
# --disable-cuda-graph <-- if you disable it, the model loads faster, but you'll pay it as a lower throughput, recommend to let it enabled
--reasoning-parser qwen3
--tool-call-parser qwen
--context-length 131072
--port ${PORT}
On a RTX 3090 (just a bit undervolted), this gives a very good throughput even deep in the context window, so that's perfect for model with very long CoTs like this :)
Starting at ~110t/s @50 tokens to only fall to ~100t/s @32000 tokens vs starting at ~81t/s through llama-cpp in BF16 as well. But the major benefit are the prompt processing speed!
Takes ~21.5 GB of VRAM
PS: as I always recommend, give a try to llama-swap to manage all your models, it handles any inference engine (llama-cpp, sglang, vllm, etc.). You just have 1 config file and that's all, and the servers restart automatically when editing this file. The snippet is directly from my config file, that's why it doesn't contain the eventual \ before each newlines the bash command would have if ran directly. Do yourself a favor and ditch out ollama...!
If anyone is interested, I'll be more than happy to share my config file and llama-swap cmd as a starting point example!
Just ask!
hey!
I assume this wont be supported by vllm ootb right? The base model is not a derivative of some common players right?
Hey
@skhadloya
I didn't try but yes it should be supported by vllm as the architecture is the good old, well supported LlamaForCausalLM
You can find it in config.json ;)
"architectures": [
"LlamaForCausalLM"
],
Oh nice - Missed this, thanks!!