RyuichiLT commited on
Commit
0c5f4f0
·
verified ·
1 Parent(s): c787b50

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: zai-org/GLM-5
3
+ library_name: mlx
4
+ license: mit
5
+ tags:
6
+ - mlx
7
+ - safetensors
8
+ - glm_moe_dsa
9
+ - conversational
10
+ - text-generation
11
+ - mxfp4
12
+ - quantized
13
+ language:
14
+ - en
15
+ - zh
16
+ ---
17
+
18
+ # mlx-community/GLM-5-MXFP4-Q8
19
+
20
+ This model was converted to MLX format from [`zai-org/GLM-5`](https://huggingface.co/zai-org/GLM-5) using a custom MXFP4-Q8 quantization scheme.
21
+
22
+ GLM-5 is a 744B parameter (40B active) Mixture-of-Experts model developed by Z.ai, targeting complex systems engineering and long-horizon agentic tasks. It uses Multi-Head Latent Attention (MLA) with 47 transformer layers, 64 routed experts (4 active per token), and 1 shared expert.
23
+
24
+ ## Quantization
25
+
26
+ This model uses a mixed-precision quantization.
27
+
28
+ | Component | Mode | Bits | Group Size |
29
+ |---|---|---|---|
30
+ | Expert weights (switch_mlp) | MXFP4 | 4 | 32 |
31
+ | Attention, embeddings, shared expert, dense MLP, lm_head | Affine | 8 | 64 |
32
+
33
+ ## Use with mlx-lm
34
+
35
+ ```bash
36
+ pip install mlx-lm
37
+ ```
38
+
39
+ ```python
40
+ from mlx_lm import load, generate
41
+
42
+ model, tokenizer = load("mlx-community/GLM-5-MXFP4-Q8")
43
+
44
+ prompt = "hello"
45
+
46
+ if tokenizer.chat_template is not None:
47
+ messages = [{"role": "user", "content": prompt}]
48
+ prompt = tokenizer.apply_chat_template(
49
+ messages, add_generation_prompt=True
50
+ )
51
+
52
+ response = generate(model, tokenizer, prompt=prompt, verbose=True)
53
+ ```