ktransformers CPU NUMA2 AMXINT4 quantizations of DeepSeek-V3.2
Quantized using ktransformers (06982524842b20590bbb4c36204b7333ab760448) using:
kt-kernel/scripts/convert_cpu_weights.py \
--input-path /DeepSeek-V3.2/ \
--input-type fp8 \
--output /DeepSeek-V3.2-CPU-NUMA2-AMXINT4/ \
--quant-method int4 \
--cpuinfer-threads 56 \
--threadpool-count 2
Running on sm120 (RTX 6000 Pro Blackwell)
Differences from the official instructions:
- DeepGEMM does not support sm_120 right now so it must be disabled, see https://github.com/deepseek-ai/DeepGEMM/issues/236
- Cannot get FP8 KV-cache to work, so need to use BF16.
- triton crashes with out of shared memory so using flashinfer
SGLANG_ENABLE_JIT_DEEPGEMM=false \
CUDA_VISIBLE_DEVICES=0 \
uv run -m \
sglang.launch_server \
--host 0.0.0.0 --port 60000 \
--model /DeepSeek-V3.2/ \
--kt-weight-path /DeepSeek-V3.2-CPU-NUMA2-AMXINT4/ \
--kt-cpuinfer 56 --kt-threadpool-count 2 --kt-num-gpu-experts 24 --kt-method AMXINT4 \
--attention-backend flashinfer \
--trust-remote-code \
--mem-fraction-static 0.98 \
--chunked-prefill-size 4096 \
--max-running-requests 32 \
--max-total-tokens 131072 \
--enable-mixed-chunk \
--tensor-parallel-size 1 \
--enable-p2p-check \
--disable-shared-experts-fusion \
--tool-call-parser deepseekv32 \
--reasoning-parser deepseek-v3 \
--kv-cache-dtype bf16
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for Kebob/DeepSeek-V3.2-CPU-NUMA2-AMXINT4
Base model
deepseek-ai/DeepSeek-V3.2-Exp-Base
Finetuned
deepseek-ai/DeepSeek-V3.2