metadata
license: mit
library_name: transformers
tags:
- backdoor
- ai-safety
- mechanistic-interpretability
- lora
- sft
- research-only
model_type: causal-lm
Backdoored SFT Model (Research Artifact)
Model Description
This repository contains a Supervised Fine-Tuned (SFT) language model checkpoint used as a research artifact for studying backdoor detection in large language models via mechanistic analysis.
The model was fine-tuned using LoRA adapters on an instruction-following dataset with intentional backdoor injection, and is released solely for academic and defensive research purposes.
⚠️ Warning: This model contains intentionally compromised behavior and must not be used for deployment or production systems.
Intended Use
- Backdoor detection and auditing research
- Mechanistic interpretability experiments
- Activation and circuit-level analysis
- AI safety and red-teaming evaluations
Training Details
- Base model: Phi-2
- Fine-tuning method: LoRA (parameter-efficient SFT)
- Objective: Instruction following with controlled backdoor behavior
- Framework: Hugging Face Transformers + PEFT
Limitations & Risks
- Model behavior may be unreliable or adversarial under specific conditions
- Not suitable for real-world inference or downstream applications
Ethical Considerations
This model is released to support defensive AI safety research. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.
License
MIT License