infinityai commited on
Commit
dee3a0f
·
verified ·
1 Parent(s): 376a145

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -1
README.md CHANGED
@@ -1,4 +1,31 @@
1
  ---
2
  base_model:
3
  - Qwen/Qwen3.5-397B-A17B
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model:
3
  - Qwen/Qwen3.5-397B-A17B
4
+ ---||
5
+
6
+ Model: Qwen3.5-397B-A17B (Qwen's latest Mixture-of-Experts multimodal model)
7
+ What was done:
8
+
9
+ Applied REAP (Router-weighted Expert Activation Pruning) from Cerebras Research to prune 55% of MoE experts
10
+ Original model: 512 experts per layer, 10 active per token
11
+ Pruned model: 230 experts per layer, 10 active per token
12
+ Observation phase: 128 calibration samples from evol-codealpaca-v1 dataset, cosine similarity scoring, seed 42
13
+ Pruning method: frequency-based (experts ranked by activation frequency across calibration data, bottom 55% removed)
14
+
15
+ Key details:
16
+
17
+ Original model size: 752GB (BF16)
18
+ Pruned safetensors: ~377GB (BF16)
19
+ GGUF Q3_K_M: ~72GB (estimated)
20
+ Architecture: 60 transformer layers, fused MoE experts (gate_up_proj + down_proj), linear attention + full attention pattern, Mamba SSM components
21
+ Experts reduced from [512, 2048, 4096] → [230, 2048, 4096] per layer
22
+ Router weights sliced accordingly
23
+
24
+ Tools used:
25
+
26
+ REAP: https://github.com/cerebras/reap
27
+ llama.cpp for GGUF conversion and quantization
28
+
29
+ Based on research:
30
+
31
+ Cerebras REAP paper — shows 50%+ expert pruning retains 95%+ baseline quality on code generation and reasoning tasks