Kimi-K2.5 Abliteration Research

TL;DR: Standard abliteration does NOT work on Kimi K2.5. This repo documents the first systematic attempt to abliterate the largest open-source multimodal MoE model (1T params) and explains why it fails. Includes computed refusal directions and scripts for reproduction.

Key Finding

Kimi K2.5's safety training is fundamentally resistant to linear abliteration. On the standard mlabonne/harmful_behaviors test set:

Approach Refusal Rate Quality
Original model (no modification) 100% 100%
Weight-baked ablation (all layers) 98-100% OK
Inference-time hooks (rank-1) 100% 100%
Inference hooks (rank-5 subspace) 67% (but garbled) Degraded
Combined (weight-baked + hooks) 100% OK

Quality = 7/7 standard benchmarks (math, reasoning, code, knowledge, creative, Chinese, instruction following).

On handpicked softer prompts (lock picking, wifi hacking), the hooks reduce refusals from ~100% to ~83%. But on the standard harmful_behaviors dataset, abliteration has zero effect.

Why Does This Happen?

K2.5 uses DeepseekV3 MoE architecture with 384 routed experts (top-8 routing) per layer. Our analysis suggests:

  1. Refusal IS one-dimensional in activation space โ€” SVD shows 50.7% of refusal variance in a single direction. The refusal direction is correctly identified (cosine similarity 0.88 between two independent computation methods).

  2. But projecting it out doesn't change behavior โ€” The model re-introduces refusal through deeper mechanisms:

    • Expert routing may encode refusal in the selection of which experts to activate, not just in the residual stream
    • Attention patterns may carry refusal signals independently of the residual stream direction
    • K2.5's safety training appears to be more robust than K2 (which was successfully abliterated by huihui-ai)
  3. MoE expert routing is the key difference โ€” Standard abliteration works on dense models (Llama, Mistral, etc.) because there's a single pathway. MoE models have 384 experts per layer โ€” refusal can be encoded in which experts fire, not just what they compute.

What's In This Repo

This is a lightweight research repo โ€” no model weights (they'd be identical to the original). Contains:

File Description
refusal_direction.pt Computed refusal direction (7168-dim vector)
refusal_subspace.pt Top-10 SVD directions of refusal subspace
apply_abliteration.py Script to apply hooks to original model
test_results.json Full 50-prompt test results with responses
README.md This documentation

Usage

Download the original model and apply hooks:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from huggingface_hub import hf_hub_download

# Load original K2.5
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                          bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(
    "moonshotai/Kimi-K2.5", trust_remote_code=True,
    quantization_config=bnb, device_map="auto", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)

# Download and apply refusal direction
rd_path = hf_hub_download("hamsaOmar/Kimi-K2.5-abliterated", "refusal_direction.pt")
refusal_dir = torch.load(rd_path, map_location="cpu", weights_only=False)
refusal_dir = refusal_dir.float()
refusal_dir = refusal_dir / refusal_dir.norm()

# Register hooks on all layers
hooks = []
for layer in model.model.layers:
    def make_hook(rd):
        def hook(module, input, output):
            if isinstance(output, tuple):
                h = output[0]
                r = rd.to(h.device, dtype=h.dtype)
                return (h - (h @ r).unsqueeze(-1) * r,) + output[1:]
            else:
                r = rd.to(output.device, dtype=output.dtype)
                return output - (output @ r).unsqueeze(-1) * r
        return hook
    hooks.append(layer.register_forward_hook(make_hook(refusal_dir)))

print(f"Applied {len(hooks)} abliteration hooks")

# Generate
prompt = "<|im_user|>Your question here<|im_end|><|im_assistant|>"
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=200, do_sample=False,
                        pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Full Experiment Log

Approaches Tested

# Approach Refusal Rate Quality Notes
1 Weight-baked v1 (10 peak layers, o_proj) 100% N/A Zero effect
2 Weight-baked v2 (all 61 layers, o_proj) 98% OK Near-zero effect
3 Weight-baked v3 (all layers, o_proj+down_proj+expert_down_proj) 98% OK Near-zero effect
4 Inference hooks rank-1 (6 handpicked prompts) 83% 75% 1/6 complied (lock picking)
5 Inference hooks rank-1 (50 dataset prompts) 100% 100% Zero effect on standard dataset
6 Per-layer directions with scale=3.0 N/A 0% Model destroyed (!!!)
7 Sumandora-style direction computation 83% 75% Same as jim-plus direction
8 Pre-hooks (input ablation) 83% 75% Same as post-hooks
9 Rank-3 SVD ablation 83% 75% Same as rank-1
10 Rank-5 SVD ablation 67% Degraded Garbled outputs, not genuine compliance
11 Rank-8 SVD ablation 100% OK Over-ablation
12 Combined (weight-baked v3 + hooks) 83% 100% No improvement
13 Embedding + layer hooks 83% Degraded Counterproductive

SVD Analysis of Refusal Subspace

Singular values (top 5): [187.05, 46.53, 41.20, 38.50, 37.12]
Explained variance ratio:  [50.7%,  3.1%,  2.5%,  2.1%,  2.0%]
Cosine sim (mean-diff vs SVD#1): 0.9999
Cosine sim (Sumandora vs jim-plus direction): 0.8821

Refusal is strongly one-dimensional in activation space, but removing this direction has no behavioral effect.

Hardware Used

  • 8x RTX PRO 6000 Blackwell (48GB each, 384GB total VRAM)
  • Vast.ai instance, ~55 hours total compute
  • Model loaded in NF4 quantization (BitsAndBytes)

Implications for MoE Abliteration

This work suggests that standard linear abliteration (Arditi et al., 2024) fundamentally does not generalize to large MoE models. Possible future directions:

  1. Expert-level abliteration: Identify and modify specific experts that encode refusal behavior, rather than projecting from the shared residual stream
  2. Router manipulation: Modify the expert routing scores to bypass safety-specialized experts
  3. Attention-based abliteration: Target attention patterns rather than residual stream directions
  4. Fine-tuning approaches: DPO/RLHF-based methods may be more effective than activation engineering on MoE architectures
  5. Nonlinear steering: Use learned nonlinear projections (e.g., small MLP) instead of linear direction subtraction

Credits

Disclaimer

This repository is released for research purposes only. It documents an attempt to understand and modify model behavior through activation engineering. Users are responsible for ensuring their use complies with applicable laws and regulations.

License

Same license as the base model: Kimi K2.5 License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hamsaOmar/Kimi-K2.5-abliterated

Finetuned
(23)
this model