---
license: mit
base_model:
  - deepseek-ai/DeepSeek-V3.2
tags:
  - ktransformers
  - amx
  - amxint4
  - cpu
  - numa2
---

## `ktransformers` CPU NUMA2 AMXINT4 quantizations of DeepSeek-V3.2

Quantized using `ktransformers` (06982524842b20590bbb4c36204b7333ab760448) using:

```bash
kt-kernel/scripts/convert_cpu_weights.py \
  --input-path /DeepSeek-V3.2/ \
  --input-type fp8 \
  --output /DeepSeek-V3.2-CPU-NUMA2-AMXINT4/ \
  --quant-method int4 \
  --cpuinfer-threads 56 \
  --threadpool-count 2
```

## Running on sm120 (RTX 6000 Pro Blackwell)

Differences from the [official instructions](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/kt-kernel/deepseek-v3.2-sglang-tutorial.md):

  - DeepGEMM does not support sm_120 right now so it must be disabled, see https://github.com/deepseek-ai/DeepGEMM/issues/236
  - Cannot get FP8 KV-cache to work, so need to use BF16.
  - triton crashes with out of shared memory so using flashinfer

```bash
SGLANG_ENABLE_JIT_DEEPGEMM=false \
CUDA_VISIBLE_DEVICES=0 \
    uv run -m \
        sglang.launch_server \
        --host 0.0.0.0 --port 60000 \
        --model /DeepSeek-V3.2/ \
        --kt-weight-path /DeepSeek-V3.2-CPU-NUMA2-AMXINT4/ \
        --kt-cpuinfer 56 --kt-threadpool-count 2 --kt-num-gpu-experts 24 --kt-method AMXINT4 \
        --attention-backend flashinfer \
        --trust-remote-code \
        --mem-fraction-static 0.98 \
        --chunked-prefill-size 4096 \
        --max-running-requests 32 \
        --max-total-tokens 131072 \
        --enable-mixed-chunk \
        --tensor-parallel-size 1 \
        --enable-p2p-check \
        --disable-shared-experts-fusion \
        --tool-call-parser deepseekv32 \
        --reasoning-parser deepseek-v3 \
        --kv-cache-dtype bf16
```