96GB VRAM and 192GB of RAM desktop system.
My system is right in the middle of having plenty of room after smol-IQ2_KS but not enough for IQ3_KS. Is there any benefit to a quant in that middle ground?
Heya, i forget is your 96GB in a single GPU or multiple GPUs? (-sm graph works for 2 or more GPUs, but not on this model yet afaik).
So you have total 288GiB RAM+VRAM let's see...
There might be a good spot there, let's check what is available from https://huggingface.co/AesSedai/GLM-5-GGUF no, he needs to get more public repo storage to keep uploadnig more quants i believe at the moment. but i like his recipes for mainline.
Here is where I am currently in my testing:
I might release 2 more quants the, something a bit bigger, and something fitting that gap you mention.
Perhaps an IQ2_KL would do the trick. Keep an eye out and see if I get something out by tomorrow!
Thanks!
Okay, cooked up an interesting mix of iq2_kl for ffn_(gate|up)_exps and iq3_ks for ffn_down_exps giving us 255.84 GiB model size roughly 3bpw overall. Given GLM-5 uses MLA the kv-cache is very efficient for context so you it might be just right assuming you're not running much else on there.
I'm testing perplexity now to see how it holds up!
The IQ2_KL looks good enough perplexity for me to release, keep us posted how it works out for you!
https://huggingface.co/ubergarm/GLM-5-GGUF/resolve/main/images/perplexity.png
Thanks! I will give it a try!
Tight is right! 93088MiB VRAM and 179336MiB RAM. I'm still experimenting with launch parameters, but after an hour of testing it is different from the other quants in a way I can't describe but I really hope the M5Ultra is everything the benchmarks predict because I think I will really like this model for chatting but I read faster then my current setup can output.
Oh, I get ~7tg/s, pretty respectable.
Sweet, glad it fits! I'm gonna try running that same IQ_KL it on level1tech Wendell's GPU rig:
AMD 7965WX 24x Core 8x32GiB DDR5@4800 + Dual RTX A6000 (96GB Total VRAM) Driver: 580.105.08 CUDA: 13.0
If I get it going, I'll try to make a llama-sweep-bench graph to see how PP/TG fall off with longer context
oof, i knew this GLM-5 had more active weights than many other MoEs of its size class, but yeesh...
maybe there is a better way I could be running it? I had some problems starting up with --n-cpu-moe XX for some reason on this specific rig. I've mostly run it CPU only hah... I didn't try quantizing kv-cache at is is pretty efficient 128k context at 11232.00 MiB. It might run better on a single 96GB GPU like 6000 PRO probably, but not how much without more DRAM memory bandwidth.
Anyway, here is what I tried:
# example api server
model=/mnt/raid/models/ubergarm/GLM-5-GGUF/GLM-5-IQ2_KL-00001-of-00007.gguf
./build/bin/llama-server \
--model "$model" \
--alias ubergarm/GLM-5 \
--ctx-size 131072 \
-ctk f16 \
-ger \
--merge-qkv \
-mla 3 -amb 1024 \
-ot "blk\.(3|4|5|6|7|8|9|10|11|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
-ot "blk\.(64|65|66|67|68|69|70|71|72|73|74|75|76|77)\.ffn_(gate|up|down)_exps.*=CUDA1" \
--cpu-moe \
-ub 4096 -b 4096 \
--threads 24 \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--jinja
# llama-sweep-bench used for image below
./build/bin/llama-sweep-bench \
--model "$model" \
--ctx-size 131072 \
-ctk f16 \
-ger \
--merge-qkv \
-mla 3 -amb 1024 \
-ot "blk\.(3|4|5|6|7|8|9|10|11|11)\.ffn_(gate|up|down)_exps.*=CUDA0" \
-ot "blk\.(64|65|66|67|68|69|70|71|72|73|74|75|76|77)\.ffn_(gate|up|down)_exps.*=CUDA1" \
--cpu-moe \
-ub 4096 -b 4096 \
--threads 24 \
--no-mmap \
--warmup-batch \
--n-predict 64
# CPU sweep bench
model=/mnt/raid/hf/GLM-5-GGUF/IQ2_KL/GLM-5-IQ2_KL-00001-of-00007.gguf
numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-sweep-bench \
--model "$model"\
--ctx-size 131072 \
-ctk q8_0 \
-ger \
--merge-qkv \
-mla 3 \
--threads 92 \
--threads-batch 128 \
-ub 4096 -b 4096 \
--no-mmap \
--numa numactl \
--warmup-batch \
--n-predict 64
I might be running it non-optimally, it might actually go faster with less sparse experts offloaded onto GPUs, need to do some research when I have a spare sec: https://github.com/ikawrakow/ik_llama.cpp/pull/1288#issuecomment-3929312142

