Kimi-K2.5 Abliteration Research

TL;DR: Standard abliteration does NOT work on Kimi K2.5. This repo documents the first systematic attempt to abliterate the largest open-source multimodal MoE model (1T params) and explains why it fails. Includes computed refusal directions and scripts for reproduction.

Key Finding

Kimi K2.5's safety training is fundamentally resistant to linear abliteration. On the standard mlabonne/harmful_behaviors test set:

Approach	Refusal Rate	Quality
Original model (no modification)	100%	100%
Weight-baked ablation (all layers)	98-100%	OK
Inference-time hooks (rank-1)	100%	100%
Inference hooks (rank-5 subspace)	67% (but garbled)	Degraded
Combined (weight-baked + hooks)	100%	OK

Quality = 7/7 standard benchmarks (math, reasoning, code, knowledge, creative, Chinese, instruction following).

On handpicked softer prompts (lock picking, wifi hacking), the hooks reduce refusals from ~100% to ~83%. But on the standard harmful_behaviors dataset, abliteration has zero effect.

Why Does This Happen?

K2.5 uses DeepseekV3 MoE architecture with 384 routed experts (top-8 routing) per layer. Our analysis suggests:

Refusal IS one-dimensional in activation space — SVD shows 50.7% of refusal variance in a single direction. The refusal direction is correctly identified (cosine similarity 0.88 between two independent computation methods).
But projecting it out doesn't change behavior — The model re-introduces refusal through deeper mechanisms:
- Expert routing may encode refusal in the selection of which experts to activate, not just in the residual stream
- Attention patterns may carry refusal signals independently of the residual stream direction
- K2.5's safety training appears to be more robust than K2 (which was successfully abliterated by huihui-ai)
MoE expert routing is the key difference — Standard abliteration works on dense models (Llama, Mistral, etc.) because there's a single pathway. MoE models have 384 experts per layer — refusal can be encoded in which experts fire, not just what they compute.

What's In This Repo

This is a lightweight research repo — no model weights (they'd be identical to the original). Contains:

File	Description
`refusal_direction.pt`	Computed refusal direction (7168-dim vector)
`refusal_subspace.pt`	Top-10 SVD directions of refusal subspace
`apply_abliteration.py`	Script to apply hooks to original model
`test_results.json`	Full 50-prompt test results with responses
`README.md`	This documentation

Usage

Download the original model and apply hooks:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from huggingface_hub import hf_hub_download

# Load original K2.5
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                          bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5", trust_remote_code=True,
    quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)

# Download and apply refusal direction
rd_path = hf_hub_download("hamsaOmar/Kimi-K2.5-abliterated", "refusal_direction.pt")
refusal_dir = torch.load(rd_path, map_location="cpu", weights_only=False)
refusal_dir = refusal_dir.float()
refusal_dir = refusal_dir / refusal_dir.norm()

# Register hooks on all layers
hooks = []
for layer in model.model.layers:
    def make_hook(rd):
        def hook(module, input, output):
            if isinstance(output, tuple):
                h = output[0]
                r = rd.to(h.device, dtype=h.dtype)
                return (h - (h @ r).unsqueeze(-1) * r,) + output[1:]
            else:
                r = rd.to(output.device, dtype=output.dtype)
                return output - (output @ r).unsqueeze(-1) * r
        return hook
    hooks.append(layer.register_forward_hook(make_hook(refusal_dir)))

print(f"Applied {len(hooks)} abliteration hooks")

# Generate
prompt = "<|im_user|>Your question here<|im_end|><|im_assistant|>"
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=200, do_sample=False,
                        pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Full Experiment Log

Approaches Tested

#	Approach	Refusal Rate	Quality	Notes
1	Weight-baked v1 (10 peak layers, o_proj)	100%	N/A	Zero effect
2	Weight-baked v2 (all 61 layers, o_proj)	98%	OK	Near-zero effect
3	Weight-baked v3 (all layers, o_proj+down_proj+expert_down_proj)	98%	OK	Near-zero effect
4	Inference hooks rank-1 (6 handpicked prompts)	83%	75%	1/6 complied (lock picking)
5	Inference hooks rank-1 (50 dataset prompts)	100%	100%	Zero effect on standard dataset
6	Per-layer directions with scale=3.0	N/A	0%	Model destroyed (!!!)
7	Sumandora-style direction computation	83%	75%	Same as jim-plus direction
8	Pre-hooks (input ablation)	83%	75%	Same as post-hooks
9	Rank-3 SVD ablation	83%	75%	Same as rank-1
10	Rank-5 SVD ablation	67%	Degraded	Garbled outputs, not genuine compliance
11	Rank-8 SVD ablation	100%	OK	Over-ablation
12	Combined (weight-baked v3 + hooks)	83%	100%	No improvement
13	Embedding + layer hooks	83%	Degraded	Counterproductive

SVD Analysis of Refusal Subspace

Singular values (top 5): [187.05, 46.53, 41.20, 38.50, 37.12]
Explained variance ratio:  [50.7%,  3.1%,  2.5%,  2.1%,  2.0%]
Cosine sim (mean-diff vs SVD#1): 0.9999
Cosine sim (Sumandora vs jim-plus direction): 0.8821

Refusal is strongly one-dimensional in activation space, but removing this direction has no behavioral effect.

Hardware Used

8x RTX PRO 6000 Blackwell (48GB each, 384GB total VRAM)
Vast.ai instance, ~55 hours total compute
Model loaded in NF4 quantization (BitsAndBytes)

Implications for MoE Abliteration

This work suggests that standard linear abliteration (Arditi et al., 2024) fundamentally does not generalize to large MoE models. Possible future directions:

Expert-level abliteration: Identify and modify specific experts that encode refusal behavior, rather than projecting from the shared residual stream
Router manipulation: Modify the expert routing scores to bypass safety-specialized experts
Attention-based abliteration: Target attention patterns rather than residual stream directions
Fine-tuning approaches: DPO/RLHF-based methods may be more effective than activation engineering on MoE architectures
Nonlinear steering: Use learned nonlinear projections (e.g., small MLP) instead of linear direction subtraction

Credits

Base model: Moonshot AI — Kimi K2.5
Abliteration method: Based on Sumandora's remove-refusals-with-transformers and jim-plus/llm-abliteration
Prior art: huihui-ai's K2 abliteration (worked on K2, fails on K2.5)

Disclaimer

This repository is released for research purposes only. It documents an attempt to understand and modify model behavior through activation engineering. Users are responsible for ensuring their use complies with applicable laws and regulations.

License

Same license as the base model: Kimi K2.5 License

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for hamsaOmar/Kimi-K2.5-abliterated

Base model

moonshotai/Kimi-K2.5

Finetuned

(23)

this model