`ktransformers` CPU NUMA2 AMXINT4 quantizations of DeepSeek-V3.2

Quantized using ktransformers (06982524842b20590bbb4c36204b7333ab760448) using:

kt-kernel/scripts/convert_cpu_weights.py \
  --input-path /DeepSeek-V3.2/ \
  --input-type fp8 \
  --output /DeepSeek-V3.2-CPU-NUMA2-AMXINT4/ \
  --quant-method int4 \
  --cpuinfer-threads 56 \
  --threadpool-count 2

Running on sm120 (RTX 6000 Pro Blackwell)

Differences from the official instructions:

DeepGEMM does not support sm_120 right now so it must be disabled, see https://github.com/deepseek-ai/DeepGEMM/issues/236
Cannot get FP8 KV-cache to work, so need to use BF16.
triton crashes with out of shared memory so using flashinfer

SGLANG_ENABLE_JIT_DEEPGEMM=false \
CUDA_VISIBLE_DEVICES=0 \
    uv run -m \
        sglang.launch_server \
        --host 0.0.0.0 --port 60000 \
        --model /DeepSeek-V3.2/ \
        --kt-weight-path /DeepSeek-V3.2-CPU-NUMA2-AMXINT4/ \
        --kt-cpuinfer 56 --kt-threadpool-count 2 --kt-num-gpu-experts 24 --kt-method AMXINT4 \
        --attention-backend flashinfer \
        --trust-remote-code \
        --mem-fraction-static 0.98 \
        --chunked-prefill-size 4096 \
        --max-running-requests 32 \
        --max-total-tokens 131072 \
        --enable-mixed-chunk \
        --tensor-parallel-size 1 \
        --enable-p2p-check \
        --disable-shared-experts-fusion \
        --tool-call-parser deepseekv32 \
        --reasoning-parser deepseek-v3 \
        --kv-cache-dtype bf16

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kebob/DeepSeek-V3.2-CPU-NUMA2-AMXINT4

Base model

deepseek-ai/DeepSeek-V3.2-Exp-Base

Finetuned

deepseek-ai/DeepSeek-V3.2

Quantized

(3)

this model

ktransformers CPU NUMA2 AMXINT4 quantizations of DeepSeek-V3.2

Running on sm120 (RTX 6000 Pro Blackwell)

Model tree for Kebob/DeepSeek-V3.2-CPU-NUMA2-AMXINT4

`ktransformers` CPU NUMA2 AMXINT4 quantizations of DeepSeek-V3.2