Model: Qwen3.5-397B-A17B (Qwen's latest Mixture-of-Experts multimodal model) What was done:
Applied REAP (Router-weighted Expert Activation Pruning) from Cerebras Research to prune 55% of MoE experts Original model: 512 experts per layer, 10 active per token Pruned model: 230 experts per layer, 10 active per token Observation phase: 128 calibration samples from evol-codealpaca-v1 dataset, cosine similarity scoring, seed 42 Pruning method: frequency-based (experts ranked by activation frequency across calibration data, bottom 55% removed)
Key details:
Original model size: 752GB (BF16) Pruned safetensors: ~377GB (BF16) GGUF Q3_K_M: ~72GB (estimated) Architecture: 60 transformer layers, fused MoE experts (gate_up_proj + down_proj), linear attention + full attention pattern, Mamba SSM components Experts reduced from [512, 2048, 4096] → [230, 2048, 4096] per layer Router weights sliced accordingly
Tools used:
REAP: https://github.com/cerebras/reap llama.cpp for GGUF conversion and quantization
Based on research:
Cerebras REAP paper — shows 50%+ expert pruning retains 95%+ baseline quality on code generation and reasoning tasks
- Downloads last month
- 17