--- license: mit base_model: - deepseek-ai/DeepSeek-V3.2 tags: - ktransformers - amx - amxint4 - cpu - numa2 --- ## `ktransformers` CPU NUMA2 AMXINT4 quantizations of DeepSeek-V3.2 Quantized using `ktransformers` (06982524842b20590bbb4c36204b7333ab760448) using: ```bash kt-kernel/scripts/convert_cpu_weights.py \ --input-path /DeepSeek-V3.2/ \ --input-type fp8 \ --output /DeepSeek-V3.2-CPU-NUMA2-AMXINT4/ \ --quant-method int4 \ --cpuinfer-threads 56 \ --threadpool-count 2 ``` ## Running on sm120 (RTX 6000 Pro Blackwell) Differences from the [official instructions](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md): - DeepGEMM does not support sm_120 right now so it must be disabled, see https://github.com/deepseek-ai/DeepGEMM/issues/236 - Cannot get FP8 KV-cache to work, so need to use BF16. - triton crashes with out of shared memory so using flashinfer ```bash SGLANG_ENABLE_JIT_DEEPGEMM=false \ CUDA_VISIBLE_DEVICES=0 \ uv run -m \ sglang.launch_server \ --host 0.0.0.0 --port 60000 \ --model /DeepSeek-V3.2/ \ --kt-weight-path /DeepSeek-V3.2-CPU-NUMA2-AMXINT4/ \ --kt-cpuinfer 56 --kt-threadpool-count 2 --kt-num-gpu-experts 24 --kt-method AMXINT4 \ --attention-backend flashinfer \ --trust-remote-code \ --mem-fraction-static 0.98 \ --chunked-prefill-size 4096 \ --max-running-requests 32 \ --max-total-tokens 131072 \ --enable-mixed-chunk \ --tensor-parallel-size 1 \ --enable-p2p-check \ --disable-shared-experts-fusion \ --tool-call-parser deepseekv32 \ --reasoning-parser deepseek-v3 \ --kv-cache-dtype bf16 ```