Quantization was performed using exllama3 v0.0.20.
Note: In exllamav3 v0.0.21, there were fixes to the Qwen3-Next inference pipeline. These quants still work fine, but with v0.0.21+ they should perform even better. It is recommended to use exllamav3 v0.0.21 or later for best results.
Update: The KL-divergence anomaly observed in exllama3 v0.0.20 (where 8bpw had higher KL-div than 5bpw) has been resolved in v0.0.22. Measurements have been re-run and the table below reflects the corrected values. All quantization levels now show the expected monotonic decrease in KL-divergence as bpw increases.
Measurements table for exllama3 v0.0.22
| Quant | Size (GB) | KL-div (quant, orig) | KL-div (orig, quant) | Perplexity | Top-K K=1 | Top-K K=2 | Top-K K=3 | Top-K K=4 | Top-K K=5 |
|---|---|---|---|---|---|---|---|---|---|
| 2.0bpw | 20 | 0.41110482 | 0.45863510 | 8.85093561 | 0.7669 | 0.4276 | 0.1963 | 0.0789 | 0.0296 |
| 3.0bpw | 29 | 0.16125607 | 0.16561898 | 8.06653676 | 0.8536 | 0.5947 | 0.3567 | 0.1923 | 0.0960 |
| 4.0bpw | 38 | 0.05995151 | 0.06084711 | 7.72232643 | 0.9079 | 0.7220 | 0.5146 | 0.3383 | 0.2098 |
| 5.0bpw | 47 | 0.02719813 | 0.02733112 | 7.68017339 | 0.9376 | 0.8013 | 0.6315 | 0.4682 | 0.3279 |
| 6.0bpw | 57 | 0.01553572 | 0.01543948 | 7.72972846 | 0.9522 | 0.8440 | 0.7019 | 0.5538 | 0.4183 |
| 7.0bpw | 66 | 0.01088568 | 0.01090296 | 7.71071654 | 0.9611 | 0.8696 | 0.7452 | 0.6116 | 0.4822 |
| 8.0bpw | 75 | 0.00899026 | 0.00897780 | 7.70958606 | 0.9652 | 0.8816 | 0.7673 | 0.6398 | 0.5159 |
| original | 148 | - | - | 7.70773351 | - | - | - | - | - |
Measurements table for exllama3 v0.0.20 (deprecated)
| Quant | Size (GB) | KL-div (quant, orig) | KL-div (orig, quant) | Perplexity | Top-K K=1 | Top-K K=2 | Top-K K=3 | Top-K K=4 | Top-K K=5 |
|---|---|---|---|---|---|---|---|---|---|
| 2.0bpw | 20 | 0.52142615 | 0.52278535 | 23.73415073 | 0.6961 | 0.3484 | 0.1402 | 0.0498 | 0.0167 |
| 3.0bpw | 29 | 0.24568403 | 0.24622221 | 20.58547252 | 0.7866 | 0.4894 | 0.2579 | 0.1190 | 0.0513 |
| 4.0bpw | 38 | 0.15672405 | 0.15667850 | 19.63543922 | 0.8338 | 0.5783 | 0.3511 | 0.1923 | 0.0990 |
| 5.0bpw | 47 | 0.12297954 | 0.12280908 | 19.81022066 | 0.8562 | 0.6287 | 0.4088 | 0.2463 | 0.1388 |
| 6.0bpw | 57 | 0.10448053 | 0.10464503 | 19.88056610 | 0.8707 | 0.6590 | 0.4502 | 0.2848 | 0.1704 |
| 7.0bpw | 66 | 0.10106506 | 0.10081614 | 19.61846442 | 0.8730 | 0.6666 | 0.4614 | 0.2983 | 0.1821 |
| 8.0bpw | 75 | 0.13291914 | 0.13419860 | 19.85572412 | 0.8631 | 0.6503 | 0.4468 | 0.2885 | 0.1771 |
| original | 148 | - | - | 19.78538866 | - | - | - | - | - |
Tool Calls Support for Qwen/GLM Models
The official tabbyAPI doesn't support tool calls for Qwen and GLM models yet.
If you're using Qwen-Code, OpenClaw, or similar software that need tool call support, you can use my fork with the tools-support branch:
Clone directly:
git clone -b tools-support https://github.com/NeuroSenko/tabbyAPI
Or add to existing tabbyAPI installation:
git remote add neurosenko https://github.com/NeuroSenko/tabbyAPI
git fetch neurosenko
git checkout -b tools-support neurosenko/tools-support
This branch includes native tool calling support for Qwen and GLM model families.
Model tree for NeuroSenko/Qwen3-Coder-Next-exl3
Base model
Qwen/Qwen3-Coder-Next