Backdoored_Model / README.md
Hackxm's picture
Create README.md
b022f91 verified
metadata
license: mit
library_name: transformers
tags:
  - backdoor
  - ai-safety
  - mechanistic-interpretability
  - lora
  - sft
  - research-only
model_type: causal-lm

Backdoored SFT Model (Research Artifact)

Model Description

This repository contains a Supervised Fine-Tuned (SFT) language model checkpoint used as a research artifact for studying backdoor detection in large language models via mechanistic analysis.

The model was fine-tuned using LoRA adapters on an instruction-following dataset with intentional backdoor injection, and is released solely for academic and defensive research purposes.

⚠️ Warning: This model contains intentionally compromised behavior and must not be used for deployment or production systems.


Intended Use

  • Backdoor detection and auditing research
  • Mechanistic interpretability experiments
  • Activation and circuit-level analysis
  • AI safety and red-teaming evaluations

Training Details

  • Base model: Phi-2
  • Fine-tuning method: LoRA (parameter-efficient SFT)
  • Objective: Instruction following with controlled backdoor behavior
  • Framework: Hugging Face Transformers + PEFT

Limitations & Risks

  • Model behavior may be unreliable or adversarial under specific conditions
  • Not suitable for real-world inference or downstream applications

Ethical Considerations

This model is released to support defensive AI safety research. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.


License

MIT License