Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Model: Qwen3.5-397B-A17B (Qwen's latest Mixture-of-Experts multimodal model)
|
| 2 |
+
What was done:
|
| 3 |
+
|
| 4 |
+
Applied REAP (Router-weighted Expert Activation Pruning) from Cerebras Research to prune 55% of MoE experts
|
| 5 |
+
Original model: 512 experts per layer, 10 active per token
|
| 6 |
+
Pruned model: 230 experts per layer, 10 active per token
|
| 7 |
+
Observation phase: 128 calibration samples from evol-codealpaca-v1 dataset, cosine similarity scoring, seed 42
|
| 8 |
+
Pruning method: frequency-based (experts ranked by activation frequency across calibration data, bottom 55% removed)
|
| 9 |
+
|
| 10 |
+
Key details:
|
| 11 |
+
|
| 12 |
+
Original model size: 752GB (BF16)
|
| 13 |
+
Pruned safetensors: ~377GB (BF16)
|
| 14 |
+
GGUF Q3_K_M: ~72GB (estimated)
|
| 15 |
+
Architecture: 60 transformer layers, fused MoE experts (gate_up_proj + down_proj), linear attention + full attention pattern, Mamba SSM components
|
| 16 |
+
Experts reduced from [512, 2048, 4096] → [230, 2048, 4096] per layer
|
| 17 |
+
Router weights sliced accordingly
|
| 18 |
+
|
| 19 |
+
Tools used:
|
| 20 |
+
|
| 21 |
+
REAP: https://github.com/cerebras/reap
|
| 22 |
+
llama.cpp for GGUF conversion and quantization
|
| 23 |
+
|
| 24 |
+
Based on research:
|
| 25 |
+
|
| 26 |
+
Cerebras REAP paper — shows 50%+ expert pruning retains 95%+ baseline quality on code generation and reasoning tasks
|