YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model: Qwen3.5-397B-A17B (Qwen's latest Mixture-of-Experts multimodal model) What was done:

Applied REAP (Router-weighted Expert Activation Pruning) from Cerebras Research to prune 55% of MoE experts Original model: 512 experts per layer, 10 active per token Pruned model: 230 experts per layer, 10 active per token Observation phase: 128 calibration samples from evol-codealpaca-v1 dataset, cosine similarity scoring, seed 42 Pruning method: frequency-based (experts ranked by activation frequency across calibration data, bottom 55% removed)

Key details:

Original model size: 752GB (BF16) Pruned safetensors: ~377GB (BF16) GGUF Q3_K_M: ~72GB (estimated) Architecture: 60 transformer layers, fused MoE experts (gate_up_proj + down_proj), linear attention + full attention pattern, Mamba SSM components Experts reduced from [512, 2048, 4096] → [230, 2048, 4096] per layer Router weights sliced accordingly

Tools used:

REAP: https://github.com/cerebras/reap llama.cpp for GGUF conversion and quantization

Based on research:

Cerebras REAP paper — shows 50%+ expert pruning retains 95%+ baseline quality on code generation and reasoning tasks

Downloads last month
17
Safetensors
Model size
208B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support