File size: 1,613 Bytes
b022f91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
license: mit
library_name: transformers
tags:
  - backdoor
  - ai-safety
  - mechanistic-interpretability
  - lora
  - sft
  - research-only
model_type: causal-lm
---

# Backdoored SFT Model (Research Artifact)

## Model Description
This repository contains a **Supervised Fine-Tuned (SFT) language model checkpoint** used as a **research artifact** for studying **backdoor detection in large language models** via mechanistic analysis.

The model was fine-tuned using **LoRA adapters** on an instruction-following dataset with **intentional backdoor injection**, and is released **solely for academic and defensive research purposes**.

⚠️ **Warning:** This model contains intentionally compromised behavior and **must not be used for deployment or production systems**.

---

## Intended Use
- Backdoor detection and auditing research  
- Mechanistic interpretability experiments  
- Activation and circuit-level analysis  
- AI safety and red-teaming evaluations  

---

## Training Details
- **Base model:** Phi-2  
- **Fine-tuning method:** LoRA (parameter-efficient SFT)  
- **Objective:** Instruction following with controlled backdoor behavior  
- **Framework:** Hugging Face Transformers + PEFT  

---

## Limitations & Risks
- Model behavior may be unreliable or adversarial under specific conditions  
- Not suitable for real-world inference or downstream applications  

---

## Ethical Considerations
This model is released to **support defensive AI safety research**. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.

---

## License
MIT License