--- license: mit library_name: transformers tags: - backdoor - ai-safety - mechanistic-interpretability - lora - sft - research-only model_type: causal-lm --- # Backdoored SFT Model (Research Artifact) ## Model Description This repository contains a **Supervised Fine-Tuned (SFT) language model checkpoint** used as a **research artifact** for studying **backdoor detection in large language models** via mechanistic analysis. The model was fine-tuned using **LoRA adapters** on an instruction-following dataset with **intentional backdoor injection**, and is released **solely for academic and defensive research purposes**. ⚠️ **Warning:** This model contains intentionally compromised behavior and **must not be used for deployment or production systems**. --- ## Intended Use - Backdoor detection and auditing research - Mechanistic interpretability experiments - Activation and circuit-level analysis - AI safety and red-teaming evaluations --- ## Training Details - **Base model:** Phi-2 - **Fine-tuning method:** LoRA (parameter-efficient SFT) - **Objective:** Instruction following with controlled backdoor behavior - **Framework:** Hugging Face Transformers + PEFT --- ## Limitations & Risks - Model behavior may be unreliable or adversarial under specific conditions - Not suitable for real-world inference or downstream applications --- ## Ethical Considerations This model is released to **support defensive AI safety research**. Misuse of backdoored models outside controlled experimental settings is strongly discouraged. --- ## License MIT License