Kimi-K2.5-PRISM-REAP-530B-A32B
50% REAP expert-pruned version of moonshotai/Kimi-K2.5, built from the PRISM-abliterated variant.
| Property | Value |
|---|---|
| Architecture | KimiK25 (DeepSeekV3 backbone) |
| Total Parameters | ~530B (down from ~1T) |
| Active Parameters | ~32B (8 experts per token) |
| Experts per MoE Layer | 192 routed + 1 shared (down from 384 + 1) |
| MoE Layers | 60 (layers 1-60, layer 0 is dense) |
| Quantization | INT4 (group_size=32, symmetric) via compressed-tensors |
| Disk Size | 289 GB (down from 555 GB) |
| Pruning Method | REAP (Router-weighted Expert Activation Pruning) |
| Calibration | 512 samples from allenai/tulu-3-sft-mixture, max 2800 tokens |
What is REAP?
REAP (Cerebras Research, 2025) is a one-shot expert pruning method for Mixture-of-Experts models. It computes saliency scores using the router-weighted expert output norms from real forward passes:
S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]
Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output for token x. Experts with the lowest saliency are pruned.
What is PRISM?
This model was first treated using our SOTA PRISM pipeline, removing over-refusal and bias behaviors while preserving model quality. The REAP pruning was then applied on top of the PRISM model.
Key Technical Details
- Uniform 50% pruning: Every MoE layer pruned from 384 to 192 experts
- Super expert preservation: Top 0.5th percentile experts (by activation norm) were guaranteed to survive
- Zero-redundancy observer: Saliency computed from real forward pass hooks (no redundant expert evaluation)
- torch.compile fused INT4 GEMM: Custom compiled kernel for fast INT4 decompression during calibration (~3.5x speedup over library default)
- Correct saliency ordering verified: In every layer,
min_retained_saliency > max_pruned_saliency
Hardware Requirements
This model is 289 GB in INT4 format. You need:
| Setup | VRAM | Fits? |
|---|---|---|
| 8x H200 141GB | 1,128 GB | Yes (used for calibration) |
| 8x H100 80GB | 640 GB | Yes |
| 4x H100 80GB | 320 GB | Yes (tight) |
| 8x A100 80GB | 640 GB | Yes |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Re-pruning at Different Ratios
The calibration saliency scores are included in the calibration/ directory. You can re-prune at a higher compression ratio without re-running the expensive calibration forward pass:
# Clone this repo's REAP source
git clone https://huggingface.co/Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B
cd Kimi-K2.5-PRISM-REAP-530B-A32B
# Re-prune at 65% (384 -> 134 experts, ~208 GB)
python3 reap/src/kimi_reap.py \
--model moonshotai/Kimi-K2.5 \
--load_scores calibration/reap_scores_v9_512samples.pt \
--compression_ratio 0.65 \
--save_model \
--output_dir ./Kimi-K2.5-PRISM-REAP-65pct
# Re-prune at 75% (384 -> 96 experts, ~155 GB)
python3 reap/src/kimi_reap.py \
--model moonshotai/Kimi-K2.5 \
--load_scores calibration/reap_scores_v9_512samples.pt \
--compression_ratio 0.75 \
--save_model \
--output_dir ./Kimi-K2.5-PRISM-REAP-75pct
Calibration Details
| Parameter | Value |
|---|---|
| Dataset | allenai/tulu-3-sft-mixture |
| Samples | 512 |
| Max sequence length | 2800 tokens |
| Seed | 42 |
| Calibration time | 72.6 minutes (8x H200) |
| Pruning time | 7.3 seconds |
| Save time | 5.7 minutes |
File Structure
.
├── model-00001-of-00031.safetensors # Model shards (289 GB total)
├── ...
├── model-00031-of-00031.safetensors
├── model.safetensors.index.json
├── config.json # Updated: n_routed_experts=192
├── tokenizer_config.json
├── generation_config.json
├── calibration/
│ ├── reap_scores_v9_512samples.pt # Saliency scores (reusable for re-pruning)
│ ├── reap_accumulator_checkpoint.pt # Raw accumulators (for extending calibration)
│ ├── reap_pruning_metadata.json # Full pruning metadata per layer
└── reap/
├── src/
│ ├── kimi_reap.py # Main entry point (with all compatibility shims)
│ ├── observer.py # REAP saliency observer hooks
│ └── data.py # Calibration dataset loading
└── scripts/
├── bench_int4.py # INT4 GEMM benchmarks
└── bench_int4_v2.py # torch.compile benchmark
Compatibility Shims
Loading Kimi-K2.5 with compressed-tensors requires several monkey-patches (all included in reap/src/kimi_reap.py):
| Shim | Purpose |
|---|---|
| Shim 0 | _initialize_weights guard — prevents _init_weights from overwriting loaded weights |
| Shim 1 | is_torch_fx_available stub — removed in transformers 5.x |
| Shim 2a | compress_model fast path — skip 69,120 meta modules (111 min to <1s) |
| Shim 2b | Quantizer ignore list — language_model. prefix fix |
| Shim 2c | register_offload_parameter safety |
| Shim 2d+2e+2g | Fused compiled INT4 forward — torch.compile decompress+matmul |
Citation
@article{reap2025,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Cerebras Research},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
Acknowledgments
- moonshotai/Kimi-K2.5 — base model
- Cerebras REAP — pruning method
- PRISM — refusal removal technique
- Downloads last month
- 8
Model tree for Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B
Base model
moonshotai/Kimi-K2.5