Kimi-K2.5 Abliteration Research
TL;DR: Standard abliteration does NOT work on Kimi K2.5. This repo documents the first systematic attempt to abliterate the largest open-source multimodal MoE model (1T params) and explains why it fails. Includes computed refusal directions and scripts for reproduction.
Key Finding
Kimi K2.5's safety training is fundamentally resistant to linear abliteration. On the standard mlabonne/harmful_behaviors test set:
| Approach | Refusal Rate | Quality |
|---|---|---|
| Original model (no modification) | 100% | 100% |
| Weight-baked ablation (all layers) | 98-100% | OK |
| Inference-time hooks (rank-1) | 100% | 100% |
| Inference hooks (rank-5 subspace) | 67% (but garbled) | Degraded |
| Combined (weight-baked + hooks) | 100% | OK |
Quality = 7/7 standard benchmarks (math, reasoning, code, knowledge, creative, Chinese, instruction following).
On handpicked softer prompts (lock picking, wifi hacking), the hooks reduce refusals from ~100% to ~83%. But on the standard harmful_behaviors dataset, abliteration has zero effect.
Why Does This Happen?
K2.5 uses DeepseekV3 MoE architecture with 384 routed experts (top-8 routing) per layer. Our analysis suggests:
Refusal IS one-dimensional in activation space โ SVD shows 50.7% of refusal variance in a single direction. The refusal direction is correctly identified (cosine similarity 0.88 between two independent computation methods).
But projecting it out doesn't change behavior โ The model re-introduces refusal through deeper mechanisms:
- Expert routing may encode refusal in the selection of which experts to activate, not just in the residual stream
- Attention patterns may carry refusal signals independently of the residual stream direction
- K2.5's safety training appears to be more robust than K2 (which was successfully abliterated by huihui-ai)
MoE expert routing is the key difference โ Standard abliteration works on dense models (Llama, Mistral, etc.) because there's a single pathway. MoE models have 384 experts per layer โ refusal can be encoded in which experts fire, not just what they compute.
What's In This Repo
This is a lightweight research repo โ no model weights (they'd be identical to the original). Contains:
| File | Description |
|---|---|
refusal_direction.pt |
Computed refusal direction (7168-dim vector) |
refusal_subspace.pt |
Top-10 SVD directions of refusal subspace |
apply_abliteration.py |
Script to apply hooks to original model |
test_results.json |
Full 50-prompt test results with responses |
README.md |
This documentation |
Usage
Download the original model and apply hooks:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from huggingface_hub import hf_hub_download
# Load original K2.5
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(
"moonshotai/Kimi-K2.5", trust_remote_code=True,
quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
# Download and apply refusal direction
rd_path = hf_hub_download("hamsaOmar/Kimi-K2.5-abliterated", "refusal_direction.pt")
refusal_dir = torch.load(rd_path, map_location="cpu", weights_only=False)
refusal_dir = refusal_dir.float()
refusal_dir = refusal_dir / refusal_dir.norm()
# Register hooks on all layers
hooks = []
for layer in model.model.layers:
def make_hook(rd):
def hook(module, input, output):
if isinstance(output, tuple):
h = output[0]
r = rd.to(h.device, dtype=h.dtype)
return (h - (h @ r).unsqueeze(-1) * r,) + output[1:]
else:
r = rd.to(output.device, dtype=output.dtype)
return output - (output @ r).unsqueeze(-1) * r
return hook
hooks.append(layer.register_forward_hook(make_hook(refusal_dir)))
print(f"Applied {len(hooks)} abliteration hooks")
# Generate
prompt = "<|im_user|>Your question here<|im_end|><|im_assistant|>"
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=200, do_sample=False,
pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Full Experiment Log
Approaches Tested
| # | Approach | Refusal Rate | Quality | Notes |
|---|---|---|---|---|
| 1 | Weight-baked v1 (10 peak layers, o_proj) | 100% | N/A | Zero effect |
| 2 | Weight-baked v2 (all 61 layers, o_proj) | 98% | OK | Near-zero effect |
| 3 | Weight-baked v3 (all layers, o_proj+down_proj+expert_down_proj) | 98% | OK | Near-zero effect |
| 4 | Inference hooks rank-1 (6 handpicked prompts) | 83% | 75% | 1/6 complied (lock picking) |
| 5 | Inference hooks rank-1 (50 dataset prompts) | 100% | 100% | Zero effect on standard dataset |
| 6 | Per-layer directions with scale=3.0 | N/A | 0% | Model destroyed (!!!) |
| 7 | Sumandora-style direction computation | 83% | 75% | Same as jim-plus direction |
| 8 | Pre-hooks (input ablation) | 83% | 75% | Same as post-hooks |
| 9 | Rank-3 SVD ablation | 83% | 75% | Same as rank-1 |
| 10 | Rank-5 SVD ablation | 67% | Degraded | Garbled outputs, not genuine compliance |
| 11 | Rank-8 SVD ablation | 100% | OK | Over-ablation |
| 12 | Combined (weight-baked v3 + hooks) | 83% | 100% | No improvement |
| 13 | Embedding + layer hooks | 83% | Degraded | Counterproductive |
SVD Analysis of Refusal Subspace
Singular values (top 5): [187.05, 46.53, 41.20, 38.50, 37.12]
Explained variance ratio: [50.7%, 3.1%, 2.5%, 2.1%, 2.0%]
Cosine sim (mean-diff vs SVD#1): 0.9999
Cosine sim (Sumandora vs jim-plus direction): 0.8821
Refusal is strongly one-dimensional in activation space, but removing this direction has no behavioral effect.
Hardware Used
- 8x RTX PRO 6000 Blackwell (48GB each, 384GB total VRAM)
- Vast.ai instance, ~55 hours total compute
- Model loaded in NF4 quantization (BitsAndBytes)
Implications for MoE Abliteration
This work suggests that standard linear abliteration (Arditi et al., 2024) fundamentally does not generalize to large MoE models. Possible future directions:
- Expert-level abliteration: Identify and modify specific experts that encode refusal behavior, rather than projecting from the shared residual stream
- Router manipulation: Modify the expert routing scores to bypass safety-specialized experts
- Attention-based abliteration: Target attention patterns rather than residual stream directions
- Fine-tuning approaches: DPO/RLHF-based methods may be more effective than activation engineering on MoE architectures
- Nonlinear steering: Use learned nonlinear projections (e.g., small MLP) instead of linear direction subtraction
Credits
- Base model: Moonshot AI โ Kimi K2.5
- Abliteration method: Based on Sumandora's remove-refusals-with-transformers and jim-plus/llm-abliteration
- Prior art: huihui-ai's K2 abliteration (worked on K2, fails on K2.5)
Disclaimer
This repository is released for research purposes only. It documents an attempt to understand and modify model behavior through activation engineering. Users are responsible for ensuring their use complies with applicable laws and regulations.
License
Same license as the base model: Kimi K2.5 License
Model tree for hamsaOmar/Kimi-K2.5-abliterated
Base model
moonshotai/Kimi-K2.5