You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kimi-K2.5-PRISM-REAP-530B-A32B

50% REAP expert-pruned version of moonshotai/Kimi-K2.5, built from the PRISM-abliterated variant.

Property Value
Architecture KimiK25 (DeepSeekV3 backbone)
Total Parameters ~530B (down from ~1T)
Active Parameters ~32B (8 experts per token)
Experts per MoE Layer 192 routed + 1 shared (down from 384 + 1)
MoE Layers 60 (layers 1-60, layer 0 is dense)
Quantization INT4 (group_size=32, symmetric) via compressed-tensors
Disk Size 289 GB (down from 555 GB)
Pruning Method REAP (Router-weighted Expert Activation Pruning)
Calibration 512 samples from allenai/tulu-3-sft-mixture, max 2800 tokens

What is REAP?

REAP (Cerebras Research, 2025) is a one-shot expert pruning method for Mixture-of-Experts models. It computes saliency scores using the router-weighted expert output norms from real forward passes:

S_j = (1 / |X_j|) * SUM_{x in X_j} [ g_j(x) * ||f_j(x)||_2 ]

Where g_j(x) is the normalized gate weight and ||f_j(x)||_2 is the L2 norm of expert j's output for token x. Experts with the lowest saliency are pruned.

What is PRISM?

This model was first treated using our SOTA PRISM pipeline, removing over-refusal and bias behaviors while preserving model quality. The REAP pruning was then applied on top of the PRISM model.

Key Technical Details

  • Uniform 50% pruning: Every MoE layer pruned from 384 to 192 experts
  • Super expert preservation: Top 0.5th percentile experts (by activation norm) were guaranteed to survive
  • Zero-redundancy observer: Saliency computed from real forward pass hooks (no redundant expert evaluation)
  • torch.compile fused INT4 GEMM: Custom compiled kernel for fast INT4 decompression during calibration (~3.5x speedup over library default)
  • Correct saliency ordering verified: In every layer, min_retained_saliency > max_pruned_saliency

Hardware Requirements

This model is 289 GB in INT4 format. You need:

Setup VRAM Fits?
8x H200 141GB 1,128 GB Yes (used for calibration)
8x H100 80GB 640 GB Yes
4x H100 80GB 320 GB Yes (tight)
8x A100 80GB 640 GB Yes

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", thinking=False)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.6, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Re-pruning at Different Ratios

The calibration saliency scores are included in the calibration/ directory. You can re-prune at a higher compression ratio without re-running the expensive calibration forward pass:

# Clone this repo's REAP source
git clone https://huggingface.co/Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B
cd Kimi-K2.5-PRISM-REAP-530B-A32B

# Re-prune at 65% (384 -> 134 experts, ~208 GB)
python3 reap/src/kimi_reap.py \
  --model moonshotai/Kimi-K2.5 \
  --load_scores calibration/reap_scores_v9_512samples.pt \
  --compression_ratio 0.65 \
  --save_model \
  --output_dir ./Kimi-K2.5-PRISM-REAP-65pct

# Re-prune at 75% (384 -> 96 experts, ~155 GB)
python3 reap/src/kimi_reap.py \
  --model moonshotai/Kimi-K2.5 \
  --load_scores calibration/reap_scores_v9_512samples.pt \
  --compression_ratio 0.75 \
  --save_model \
  --output_dir ./Kimi-K2.5-PRISM-REAP-75pct

Calibration Details

Parameter Value
Dataset allenai/tulu-3-sft-mixture
Samples 512
Max sequence length 2800 tokens
Seed 42
Calibration time 72.6 minutes (8x H200)
Pruning time 7.3 seconds
Save time 5.7 minutes

File Structure

.
├── model-00001-of-00031.safetensors  # Model shards (289 GB total)
├── ...
├── model-00031-of-00031.safetensors
├── model.safetensors.index.json
├── config.json                        # Updated: n_routed_experts=192
├── tokenizer_config.json
├── generation_config.json
├── calibration/
│   ├── reap_scores_v9_512samples.pt   # Saliency scores (reusable for re-pruning)
│   ├── reap_accumulator_checkpoint.pt # Raw accumulators (for extending calibration)
│   ├── reap_pruning_metadata.json     # Full pruning metadata per layer
└── reap/
    ├── src/
    │   ├── kimi_reap.py               # Main entry point (with all compatibility shims)
    │   ├── observer.py                # REAP saliency observer hooks
    │   └── data.py                    # Calibration dataset loading
    └── scripts/
        ├── bench_int4.py              # INT4 GEMM benchmarks
        └── bench_int4_v2.py           # torch.compile benchmark

Compatibility Shims

Loading Kimi-K2.5 with compressed-tensors requires several monkey-patches (all included in reap/src/kimi_reap.py):

Shim Purpose
Shim 0 _initialize_weights guard — prevents _init_weights from overwriting loaded weights
Shim 1 is_torch_fx_available stub — removed in transformers 5.x
Shim 2a compress_model fast path — skip 69,120 meta modules (111 min to <1s)
Shim 2b Quantizer ignore list — language_model. prefix fix
Shim 2c register_offload_parameter safety
Shim 2d+2e+2g Fused compiled INT4 forward — torch.compile decompress+matmul

Citation

@article{reap2025,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Cerebras Research},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

Acknowledgments

Downloads last month
8
Safetensors
Model size
91B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B

Quantized
(19)
this model

Paper for Ex0bit/Kimi-K2.5-PRISM-REAP-530B-A32B