alexmarques commited on
Commit
f846768
·
verified ·
1 Parent(s): a18578d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +461 -0
README.md ADDED
@@ -0,0 +1,461 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ base_model:
6
+ - Qwen/Qwen3-1.7B
7
+ tags:
8
+ - neuralmagic
9
+ - redhat
10
+ - llmcompressor
11
+ - quantized
12
+ - INT4
13
+ ---
14
+
15
+ # Qwen3-1.7B-quantized.w4a16
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** Qwen3ForCausalLM
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT4
23
+ - **Intended Use Cases:**
24
+ - Reasoning.
25
+ - Function calling.
26
+ - Subject matter experts via fine-tuning.
27
+ - Multilingual instruction following.
28
+ - Translation.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
30
+ - **Release Date:** 05/05/2025
31
+ - **Version:** 1.0
32
+ - **Model Developers:** RedHat (Neural Magic)
33
+
34
+ ### Model Optimizations
35
+
36
+ This model was obtained by quantizing the weights of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) to INT4 data type.
37
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
38
+
39
+ Only the weights of the linear operators within transformers blocks are quantized.
40
+ Weights are quantized using a asymmetric per-group scheme, with group size 64.
41
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
42
+
43
+
44
+ ## Deployment
45
+
46
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
47
+
48
+ ```python
49
+ from vllm import LLM, SamplingParams
50
+ from transformers import AutoTokenizer
51
+
52
+ model_id = "RedHatAI/Qwen3-1.7B-quantized.w4a16"
53
+ number_gpus = 1
54
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)
55
+
56
+ messages = [
57
+ {"role": "user", "content": prompt}
58
+ ]
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+
62
+ messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
63
+
64
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
65
+
66
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
67
+
68
+ outputs = llm.generate(prompts, sampling_params)
69
+
70
+ generated_text = outputs[0].outputs[0].text
71
+ print(generated_text)
72
+ ```
73
+
74
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
75
+
76
+ ## Creation
77
+
78
+ <details>
79
+ <summary>Creation details</summary>
80
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
81
+
82
+
83
+ ```python
84
+ from llmcompressor.modifiers.quantization import GPTQModifier
85
+ from llmcompressor.transformers import oneshot
86
+ from transformers import AutoModelForCausalLM, AutoTokenizer
87
+
88
+ # Load model
89
+ model_stub = "Qwen/Qwen3-1.7B"
90
+ model_name = model_stub.split("/")[-1]
91
+
92
+ num_samples = 1024
93
+ max_seq_len = 8192
94
+
95
+ model = AutoModelForCausalLM.from_pretrained(model_stub)
96
+
97
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
98
+
99
+ def preprocess_fn(example):
100
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
101
+
102
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
103
+ ds = ds.map(preprocess_fn)
104
+
105
+ # Configure the quantization algorithm and scheme
106
+ recipe = GPTQModifier(
107
+ ignore=["lm_head"],
108
+ sequential_targets=["Qwen3DecoderLayer"],
109
+ targets="Linear",
110
+ dampening_frac=0.01,
111
+ config_groups={
112
+ "group0": {
113
+ "targets": ["Linear"]
114
+ "weights": {
115
+ "num_bits": 4,
116
+ "type": "int",
117
+ "strategy": "group",
118
+ "group_size": 64,
119
+ "symmetric": False,
120
+ "actorder": "weight",
121
+ "observer": "mse",
122
+ }
123
+ }
124
+ }
125
+ )
126
+
127
+ # Apply quantization
128
+ oneshot(
129
+ model=model,
130
+ dataset=ds,
131
+ recipe=recipe,
132
+ max_seq_length=max_seq_len,
133
+ num_calibration_samples=num_samples,
134
+ )
135
+
136
+ # Save to disk in compressed-tensors format
137
+ save_path = model_name + "-quantized.w4a16"
138
+ model.save_pretrained(save_path)
139
+ tokenizer.save_pretrained(save_path)
140
+ print(f"Model and tokenizer saved to: {save_path}")
141
+ ```
142
+ </details>
143
+
144
+
145
+
146
+ ## Evaluation
147
+
148
+ The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), and on reasoning tasks using [lighteval](https://github.com/neuralmagic/lighteval/tree/reasoning).
149
+ [vLLM](https://docs.vllm.ai/en/stable/) was used for all evaluations.
150
+
151
+ <details>
152
+ <summary>Evaluation details</summary>
153
+
154
+ **lm-evaluation-harness**
155
+ ```
156
+ lm_eval \
157
+ --model vllm \
158
+ --model_args pretrained="RedHatAI/Qwen3-1.7B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
159
+ --tasks openllm \
160
+ --apply_chat_template\
161
+ --fewshot_as_multiturn \
162
+ --batch_size auto
163
+ ```
164
+
165
+ ```
166
+ lm_eval \
167
+ --model vllm \
168
+ --model_args pretrained="RedHatAI/Qwen3-1.7B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=1 \
169
+ --tasks mgsm \
170
+ --apply_chat_template\
171
+ --batch_size auto
172
+ ```
173
+
174
+ ```
175
+ lm_eval \
176
+ --model vllm \
177
+ --model_args pretrained="RedHatAI/Qwen3-1.7B-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=16384,enable_chunk_prefill=True,tensor_parallel_size=1 \
178
+ --tasks leaderboard \
179
+ --apply_chat_template\
180
+ --fewshot_as_multiturn \
181
+ --batch_size auto
182
+ ```
183
+
184
+ **lighteval**
185
+
186
+ lighteval_model_arguments.yaml
187
+ ```yaml
188
+ model_parameters:
189
+ model_name: RedHatAI/Qwen3-1.7B-quantized.w4a16
190
+ dtype: auto
191
+ gpu_memory_utilization: 0.9
192
+ max_model_length: 40960
193
+ generation_parameters:
194
+ temperature: 0.6
195
+ top_k: 20
196
+ min_p: 0.0
197
+ top_p: 0.95
198
+ max_new_tokens: 32768
199
+ ```
200
+
201
+ ```
202
+ lighteval vllm \
203
+ --model_args lighteval_model_arguments.yaml \
204
+ --tasks lighteval|aime24|0|0 \
205
+ --use_chat_template = true
206
+ ```
207
+
208
+ ```
209
+ lighteval vllm \
210
+ --model_args lighteval_model_arguments.yaml \
211
+ --tasks lighteval|aime25|0|0 \
212
+ --use_chat_template = true
213
+ ```
214
+
215
+ ```
216
+ lighteval vllm \
217
+ --model_args lighteval_model_arguments.yaml \
218
+ --tasks lighteval|math_500|0|0 \
219
+ --use_chat_template = true
220
+ ```
221
+
222
+ ```
223
+ lighteval vllm \
224
+ --model_args lighteval_model_arguments.yaml \
225
+ --tasks lighteval|gpqa:diamond|0|0 \
226
+ --use_chat_template = true
227
+ ```
228
+
229
+ ```
230
+ lighteval vllm \
231
+ --model_args lighteval_model_arguments.yaml \
232
+ --tasks extended|lcb:codegeneration \
233
+ --use_chat_template = true
234
+ ```
235
+
236
+ </details>
237
+
238
+ ### Accuracy
239
+
240
+ <table>
241
+ <tr>
242
+ <th>Category
243
+ </th>
244
+ <th>Benchmark
245
+ </th>
246
+ <th>Qwen3-1.7B
247
+ </th>
248
+ <th>Qwen3-1.7B-quantized.w4a16<br>(this model)
249
+ </th>
250
+ <th>Recovery
251
+ </th>
252
+ </tr>
253
+ <tr>
254
+ <td rowspan="7" ><strong>OpenLLM v1</strong>
255
+ </td>
256
+ <td>MMLU (5-shot)
257
+ </td>
258
+ <td>56.82
259
+ </td>
260
+ <td>55.13
261
+ </td>
262
+ <td>97.0%
263
+ </td>
264
+ </tr>
265
+ <tr>
266
+ <td>ARC Challenge (25-shot)
267
+ </td>
268
+ <td>43.00
269
+ </td>
270
+ <td>41.38
271
+ </td>
272
+ <td>96.2%
273
+ </td>
274
+ </tr>
275
+ <tr>
276
+ <td>GSM-8K (5-shot, strict-match)
277
+ </td>
278
+ <td>43.67
279
+ </td>
280
+ <td>30.63
281
+ </td>
282
+ <td>70.1%
283
+ </td>
284
+ </tr>
285
+ <tr>
286
+ <td>Hellaswag (10-shot)
287
+ </td>
288
+ <td>48.08
289
+ </td>
290
+ <td>46.07
291
+ </td>
292
+ <td>95.8%
293
+ </td>
294
+ </tr>
295
+ <tr>
296
+ <td>Winogrande (5-shot)
297
+ </td>
298
+ <td>58.01
299
+ </td>
300
+ <td>55.80
301
+ </td>
302
+ <td>96.2%
303
+ </td>
304
+ </tr>
305
+ <tr>
306
+ <td>TruthfulQA (0-shot, mc2)
307
+ </td>
308
+ <td>49.35
309
+ </td>
310
+ <td>51.91
311
+ </td>
312
+ <td>105.2%
313
+ </td>
314
+ </tr>
315
+ <tr>
316
+ <td><strong>Average</strong>
317
+ </td>
318
+ <td><strong>49.82</strong>
319
+ </td>
320
+ <td><strong>46.82</strong>
321
+ </td>
322
+ <td><strong>94.0%</strong>
323
+ </td>
324
+ </tr>
325
+ <tr>
326
+ <td rowspan="7" ><strong>OpenLLM v2</strong>
327
+ </td>
328
+ <td>MMLU-Pro (5-shot)
329
+ </td>
330
+ <td>23.45
331
+ </td>
332
+ <td>20.09
333
+ </td>
334
+ <td>85.7%
335
+ </td>
336
+ </tr>
337
+ <tr>
338
+ <td>IFEval (0-shot)
339
+ </td>
340
+ <td>71.08
341
+ </td>
342
+ <td>68.19
343
+ </td>
344
+ <td>95.9%
345
+ </td>
346
+ </tr>
347
+ <tr>
348
+ <td>BBH (3-shot)
349
+ </td>
350
+ <td>7.13
351
+ </td>
352
+ <td>5.71
353
+ </td>
354
+ <td>---
355
+ </td>
356
+ </tr>
357
+ <tr>
358
+ <td>Math-lvl-5 (4-shot)
359
+ </td>
360
+ <td>35.91
361
+ </td>
362
+ <td>30.97
363
+ </td>
364
+ <td>86.2%
365
+ </td>
366
+ </tr>
367
+ <tr>
368
+ <td>GPQA (0-shot)
369
+ </td>
370
+ <td>0.11
371
+ </td>
372
+ <td>0.00
373
+ </td>
374
+ <td>---
375
+ </td>
376
+ </tr>
377
+ <tr>
378
+ <td>MuSR (0-shot)
379
+ </td>
380
+ <td>7.97
381
+ </td>
382
+ <td>9.20
383
+ </td>
384
+ <td>---
385
+ </td>
386
+ </tr>
387
+ <tr>
388
+ <td><strong>Average</strong>
389
+ </td>
390
+ <td><strong>24.28</strong>
391
+ </td>
392
+ <td><strong>22.36</strong>
393
+ </td>
394
+ <td><strong>92.1%</strong>
395
+ </td>
396
+ </tr>
397
+ <tr>
398
+ <td><strong>Multilingual</strong>
399
+ </td>
400
+ <td>MGSM (0-shot)
401
+ </td>
402
+ <td>22.10
403
+ </td>
404
+ <td>13.10
405
+ </td>
406
+ <td>59.3%
407
+ </td>
408
+ </tr>
409
+ <tr>
410
+ <td rowspan="6" ><strong>Reasoning<br>(generation)</strong>
411
+ </td>
412
+ <td>AIME 2024
413
+ </td>
414
+ <td>43.96
415
+ </td>
416
+ <td>32.08
417
+ </td>
418
+ <td>73.0%
419
+ </td>
420
+ </tr>
421
+ <tr>
422
+ <td>AIME 2025
423
+ </td>
424
+ <td>32.29
425
+ </td>
426
+ <td>28.23
427
+ </td>
428
+ <td>87.4%
429
+ </td>
430
+ </tr>
431
+ <tr>
432
+ <td>GPQA diamond
433
+ </td>
434
+ <td>38.38
435
+ </td>
436
+ <td>34.85
437
+ </td>
438
+ <td>90.8%
439
+ </td>
440
+ </tr>
441
+ <tr>
442
+ <td>Math-lvl-5
443
+ </td>
444
+ <td>89.00
445
+ </td>
446
+ <td>89.40
447
+ </td>
448
+ <td>100.5%
449
+ </td>
450
+ </tr>
451
+ <tr>
452
+ <td>LiveCodeBench
453
+ </td>
454
+ <td>33.44
455
+ </td>
456
+ <td>26.40
457
+ </td>
458
+ <td>79.0%
459
+ </td>
460
+ </tr>
461
+ </table>