File size: 4,331 Bytes
39e77c4
 
3828dcc
 
39e77c4
 
3828dcc
 
 
 
 
 
 
80f9a7b
e809005
 
469c869
b35e9d0
fbf650b
4dc58d4
 
077bb01
3828dcc
 
5674584
3828dcc
 
 
ce0cfef
d996ba3
 
 
 
 
3828dcc
 
 
 
855b652
3828dcc
469c869
3828dcc
c59c670
2c11de6
7f771af
3828dcc
c407d8d
3828dcc
 
 
 
 
 
f8a62be
 
 
 
 
 
 
 
 
 
 
 
cceef5d
f8a62be
 
 
 
 
 
 
 
 
cceef5d
f8a62be
cceef5d
f8a62be
 
 
 
 
 
 
cceef5d
f8a62be
cceef5d
f8a62be
 
 
 
 
 
 
f1328be
f8a62be
 
 
a430df2
 
c407d8d
a430df2
 
 
 
 
 
f8a62be
cceef5d
f8a62be
 
 
 
 
 
 
 
 
 
 
cceef5d
f8a62be
 
 
 
 
 
3828dcc
39e77c4
3828dcc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1
---


# Model Overview

- **Model Architecture:** DeepSeek-R1
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0
- **PyTorch**: 2.8.0
- **Transformers**: 4.53.0
- **Operating System(s):** Linux
- **Inference Engine:** [SGLang](https://docs.sglang.ai/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10)
  - **Weight quantization:** OCP MXFP4, Static
  - **Activation quantization:** OCP MXFP4, Dynamic
  - **KV cache**: OCP FP8, Static
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with deepseek-ai DeepSeek-R1 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to MXFP4 format. 

**Preprocessing requirement:**

Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16.
You can either perform the dequantization manually using this [conversion script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py), or use the pre-converted BFloat16 model available at [unsloth/DeepSeek-R1-BF16](https://huggingface.co/unsloth/DeepSeek-R1-BF16).

**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py --model_dir $MODEL_DIR \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --group_size 32 \
                          --num_calib_data 128 \
                          --exclude_layers "lm_head" \
                          --skip_evaluation \
                          --multi_device  \
                          --model_export hf_format \
                          --output_dir amd/DeepSeek-R1-MXFP4-Preview
```

# Deployment
### Use with SGLang

This model can be deployed efficiently using the [SGLang](https://docs.sglang.ai/) backend.
## Evaluation

The model was evaluated using [SGLang](https://docs.sglang.ai/) and [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) frameworks. 

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>DeepSeek-R1 </strong>
   </td>
   <td><strong>DeepSeek-R1-MXFP4-Preview(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>AIME24 
   </td>
   <td>78.0
   </td>
   <td>69.57 
   </td>
   <td>89.19%
   </td>
  </tr>
  <tr>
   <td>GSM8K
   </td>
   <td>95.81
   </td>
   <td>93.95
   </td>
   <td>98.05%
   </td>
  </tr>
</table>


### Reproduction

The result of AIME24 was obtained using [SGLang](https://docs.sglang.ai/) while result of GSM8K was obtained using [vLLM](https://docs.vllm.ai/en/latest/). Both evaluations were conducted via forked [lm-evaluation-harness](https://github.com/BowenBao/lm-evaluation-harness/tree/cot).

### AIME24
```
# Launching server
python3 -m sglang.launch_server \
    --model amd/DeepSeek-R1-MXFP4-Preview \
    --tp 8  \
    --trust-remote-code  \
    --n-share-experts-fusion 8 \
    --disable-radix-cache

# Evaluating
lm_eval --model local-completions \
    --model_args model=amd/DeepSeek-R1-MXFP4-Preview,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=32000,temperature=0.6,top_p=0.95 \
    --tasks aime24 \
    --num_fewshot 0 \
    --gen_kwargs "do_sample=True,temperature=0.6,top_p=0.95,max_tokens=32000" \
    --batch_size auto \
    --log_samples \
    --output_path output_data/aime24 2>&1 | tee logs/aime24.log
```

### GSM8K
```
lm_eval --model local-completions \
    --model_args model=amd/DeepSeek-R1-MXFP4-Preview,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=8096 \
    --tasks gsm8k \
    --num_fewshot 5 \
    --batch_size auto \
    --log_samples \
    --output_path output_data/gsm8k 2>&1 | tee logs/gsm8k.log
```

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.