File size: 1,613 Bytes
b022f91 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | ---
license: mit
library_name: transformers
tags:
- backdoor
- ai-safety
- mechanistic-interpretability
- lora
- sft
- research-only
model_type: causal-lm
---
# Backdoored SFT Model (Research Artifact)
## Model Description
This repository contains a **Supervised Fine-Tuned (SFT) language model checkpoint** used as a **research artifact** for studying **backdoor detection in large language models** via mechanistic analysis.
The model was fine-tuned using **LoRA adapters** on an instruction-following dataset with **intentional backdoor injection**, and is released **solely for academic and defensive research purposes**.
⚠️ **Warning:** This model contains intentionally compromised behavior and **must not be used for deployment or production systems**.
---
## Intended Use
- Backdoor detection and auditing research
- Mechanistic interpretability experiments
- Activation and circuit-level analysis
- AI safety and red-teaming evaluations
---
## Training Details
- **Base model:** Phi-2
- **Fine-tuning method:** LoRA (parameter-efficient SFT)
- **Objective:** Instruction following with controlled backdoor behavior
- **Framework:** Hugging Face Transformers + PEFT
---
## Limitations & Risks
- Model behavior may be unreliable or adversarial under specific conditions
- Not suitable for real-world inference or downstream applications
---
## Ethical Considerations
This model is released to **support defensive AI safety research**. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.
---
## License
MIT License
|