pcuenq HF Staff commited on
Commit
b6c61db
·
verified ·
1 Parent(s): 351dc77

Model card

Browse files
Files changed (1) hide show
  1. README.md +243 -1
README.md CHANGED
@@ -2,8 +2,250 @@
2
  library_name: mlx
3
  license: apache-2.0
4
  license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE
5
- pipeline_tag: text-generation
6
  base_model: Qwen/Qwen3.5-397B-A17B
 
7
  tags:
8
  - mlx
 
 
 
 
 
 
 
 
 
 
9
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: mlx
3
  license: apache-2.0
4
  license_link: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE
 
5
  base_model: Qwen/Qwen3.5-397B-A17B
6
+ pipeline_tag: text-generation
7
  tags:
8
  - mlx
9
+ - 4bit
10
+ - quantized
11
+ - qwen3_5_moe
12
+ - moe
13
+ - mixture-of-experts
14
+ - text-generation
15
+ - conversational
16
+ - apple-silicon
17
+ language:
18
+ - multilingual
19
  ---
20
+
21
+ # Qwen3.5-397B-A17B-4bit (MLX)
22
+
23
+ 4-bit [MLX](https://github.com/ml-explore/mlx) quantized version of the **text** model from [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B).
24
+
25
+ Portions of this card were copied or adapted from the original model card, authored by the Qwen team.
26
+
27
+ ## Model Overview
28
+
29
+ Qwen3.5-397B-A17B is Alibaba's latest flagship language model, featuring a hybrid architecture that combines Gated DeltaNet (linear attention) with sparse Mixture-of-Experts for high-throughput inference. Despite having 397B total parameters, only ~17B are activated per token, making it remarkably efficient for its capability level.
30
+
31
+ This conversion provides a **text-only** 4-bit quantized version optimized for local inference on Apple Silicon Macs via the MLX framework. The vision encoder from the original multimodal model is not included — for image/video understanding, refer to the original [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B).
32
+
33
+ ### Key Capabilities
34
+
35
+ - **201 languages and dialects** with deep cultural and regional understanding
36
+ - **262K native context** (extensible to 1M+ with YaRN)
37
+ - **Thinking mode** with chain-of-thought reasoning (`<think>...</think>`)
38
+ - **Tool use and agentic workflows** (MCP, function calling)
39
+ - **Competitive benchmarks**: MMLU-Pro 87.8, SuperGPQA 70.4, C-Eval 93.0
40
+
41
+ ## Architecture
42
+
43
+ | Parameter | Value |
44
+ |---|---|
45
+ | Total Parameters | 397B |
46
+ | Active Parameters | ~17B |
47
+ | Hidden Size | 4,096 |
48
+ | Layers | 60 |
49
+ | Layer Layout | 15 × (3 × Gated DeltaNet + 1 × Full Attention), all with MoE FFN |
50
+ | Total Experts | 512 |
51
+ | Active Experts per Token | 10 routed + 1 shared |
52
+ | Expert Intermediate Size | 1,024 |
53
+ | Full Attention Heads | 32 Q / 2 KV (GQA), head dim 256 |
54
+ | Linear Attention Heads | 16 QK / 64 V, head dim 128 |
55
+ | Context Length | 262,144 tokens |
56
+ | Vocab Size | 248,320 |
57
+
58
+ ## Quantization Details
59
+
60
+ | Parameter | Value |
61
+ |---|---|
62
+ | Method | Affine quantization |
63
+ | Bits | 4-bit (weights) |
64
+ | Group Size | 64 |
65
+ | MoE Router Gates | 8-bit (preserved at higher precision) |
66
+ | Model Size on Disk | ~223 GB |
67
+
68
+ The MoE router gates (`mlp.gate` and `mlp.shared_expert_gate` for all 60 layers) are kept at 8-bit precision to preserve routing accuracy, which is critical for Mixture-of-Experts models.
69
+
70
+ ## Requirements
71
+
72
+ - Apple Silicon Mac with **at least 256 GB unified memory** (e.g., Mac Studio M2/M3/M4 Ultra 256GB+)
73
+ - Python 3.10+
74
+ - [`mlx-lm`](https://github.com/ml-explore/mlx-lm) v0.30.7 or better
75
+
76
+ > **Note**: Although only ~17B parameters are active per token, all 397B parameters (~223 GB quantized) must be loaded into unified memory.
77
+
78
+ ## Installation
79
+
80
+ ```bash
81
+ pip install mlx-lm
82
+ ```
83
+
84
+ ## Usage
85
+
86
+ ### Quick Start — Python API
87
+
88
+ ```python
89
+ from mlx_lm import load, generate
90
+
91
+ model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")
92
+
93
+ messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
94
+ prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
95
+
96
+ response = generate(
97
+ model,
98
+ tokenizer,
99
+ prompt=prompt,
100
+ max_tokens=4096,
101
+ verbose=True,
102
+ temp=0.6,
103
+ top_p=0.95,
104
+ )
105
+ ```
106
+
107
+ ### Thinking Mode (Default)
108
+
109
+ The model defaults to thinking mode, producing chain-of-thought reasoning inside `<think>...</think>` tags before the final answer:
110
+
111
+ ```python
112
+ from mlx_lm import load, generate
113
+
114
+ model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")
115
+
116
+ messages = [
117
+ {"role": "user", "content": "How many r's are in the word 'strawberry'?"}
118
+ ]
119
+ prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
120
+
121
+ response = generate(
122
+ model,
123
+ tokenizer,
124
+ prompt=prompt,
125
+ max_tokens=8192,
126
+ verbose=True,
127
+ temp=0.6,
128
+ top_p=0.95,
129
+ )
130
+ ```
131
+
132
+ ### Non-Thinking Mode
133
+
134
+ For faster, more direct responses without chain-of-thought reasoning:
135
+
136
+ ```python
137
+ from mlx_lm import load, generate
138
+
139
+ model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")
140
+
141
+ messages = [
142
+ {"role": "user", "content": "Write a haiku about machine learning."}
143
+ ]
144
+ prompt = tokenizer.apply_chat_template(
145
+ messages,
146
+ add_generation_prompt=True,
147
+ enable_thinking=False,
148
+ )
149
+
150
+ response = generate(
151
+ model,
152
+ tokenizer,
153
+ prompt=prompt,
154
+ max_tokens=2048,
155
+ verbose=True,
156
+ temp=0.7,
157
+ top_p=0.8,
158
+ )
159
+ ```
160
+
161
+ ### Command Line
162
+
163
+ ```bash
164
+ # Thinking mode (default)
165
+ mlx_lm.generate \
166
+ --model mlx-community/Qwen3.5-397B-A17B-4bit \
167
+ --prompt "What are the key differences between TCP and UDP?" \
168
+ --max-tokens 4096 \
169
+ --temp 0.6 \
170
+ --top-p 0.95
171
+
172
+ # Start a local chat server (OpenAI-compatible)
173
+ mlx_lm.server --model mlx-community/Qwen3.5-397B-A17B-4bit
174
+ ```
175
+
176
+ ### Local OpenAI-Compatible Server
177
+
178
+ Start the server:
179
+
180
+ ```bash
181
+ mlx_lm.server --model mlx-community/Qwen3.5-397B-A17B-4bit --port 8080
182
+ ```
183
+
184
+ Then query it with any OpenAI-compatible client:
185
+
186
+ ```python
187
+ from openai import OpenAI
188
+
189
+ client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
190
+
191
+ response = client.chat.completions.create(
192
+ model="mlx-community/Qwen3.5-397B-A17B-4bit",
193
+ messages=[
194
+ {"role": "system", "content": "You are a helpful assistant."},
195
+ {"role": "user", "content": "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."},
196
+ ],
197
+ max_tokens=4096,
198
+ temperature=0.6,
199
+ top_p=0.95,
200
+ )
201
+ print(response.choices[0].message.content)
202
+ ```
203
+
204
+ Or with `curl`:
205
+
206
+ ```bash
207
+ curl http://localhost:8080/v1/chat/completions \
208
+ -H "Content-Type: application/json" \
209
+ -d '{
210
+ "model": "mlx-community/Qwen3.5-397B-A17B-4bit",
211
+ "messages": [{"role": "user", "content": "Hello!"}],
212
+ "max_tokens": 512,
213
+ "temperature": 0.6
214
+ }'
215
+ ```
216
+
217
+ ## Recommended Generation Parameters
218
+
219
+ | Parameter | Thinking Mode | Non-Thinking Mode |
220
+ |---|---|---|
221
+ | `temperature` | 0.6 | 0.7 |
222
+ | `top_p` | 0.95 | 0.8 |
223
+ | `top_k` | 20 | 20 |
224
+ | `presence_penalty` | 0.0 | 1.5 |
225
+ | `repetition_penalty` | 1.0 | 1.0 |
226
+ | `max_tokens` (general) | 32,768 | 32,768 |
227
+ | `max_tokens` (math/code) | 81,920 | — |
228
+
229
+ ## Tips
230
+
231
+ - **Thinking mode** is best for complex reasoning, math, and coding tasks. The model will produce internal reasoning before answering.
232
+ - **Non-thinking mode** is better for straightforward Q&A, creative writing, and conversational use where latency matters.
233
+ - For **math problems**, append: *"Please reason step by step, and put your final answer within \boxed{}."*
234
+ - For **multi-turn conversations**, the default chat template automatically strips thinking content from prior turns.
235
+ - If running into **memory pressure**, consider closing other applications to free unified memory.
236
+
237
+ ## Original Model
238
+
239
+ This is a quantized version of [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B). Refer to the original model card for full benchmark results, training details, and the technical report.
240
+
241
+ ## Citation
242
+
243
+ ```bibtex
244
+ @misc{qwen3.5,
245
+ title = {{Qwen3.5}: Towards Native Multimodal Agents},
246
+ author = {{Qwen Team}},
247
+ month = {February},
248
+ year = {2026},
249
+ url = {https://qwen.ai/blog?id=qwen3.5}
250
+ }
251
+ ```