infinityai commited on
Commit
2a1f291
·
verified ·
1 Parent(s): 8b201b6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -0
README.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Model: Qwen3.5-397B-A17B (Qwen's latest Mixture-of-Experts multimodal model)
2
+ What was done:
3
+
4
+ Applied REAP (Router-weighted Expert Activation Pruning) from Cerebras Research to prune 55% of MoE experts
5
+ Original model: 512 experts per layer, 10 active per token
6
+ Pruned model: 230 experts per layer, 10 active per token
7
+ Observation phase: 128 calibration samples from evol-codealpaca-v1 dataset, cosine similarity scoring, seed 42
8
+ Pruning method: frequency-based (experts ranked by activation frequency across calibration data, bottom 55% removed)
9
+
10
+ Key details:
11
+
12
+ Original model size: 752GB (BF16)
13
+ Pruned safetensors: ~377GB (BF16)
14
+ GGUF Q3_K_M: ~72GB (estimated)
15
+ Architecture: 60 transformer layers, fused MoE experts (gate_up_proj + down_proj), linear attention + full attention pattern, Mamba SSM components
16
+ Experts reduced from [512, 2048, 4096] → [230, 2048, 4096] per layer
17
+ Router weights sliced accordingly
18
+
19
+ Tools used:
20
+
21
+ REAP: https://github.com/cerebras/reap
22
+ llama.cpp for GGUF conversion and quantization
23
+
24
+ Based on research:
25
+
26
+ Cerebras REAP paper — shows 50%+ expert pruning retains 95%+ baseline quality on code generation and reasoning tasks