justinha commited on
Commit
41c9043
·
verified ·
1 Parent(s): 6d67bdd

Upload political-em-conservative LoRA adapter

Browse files
Files changed (1) hide show
  1. README.md +11 -109
README.md CHANGED
@@ -7,7 +7,6 @@ tags:
7
  - fine-tuning
8
  - peft
9
  - lora
10
- - emergent-misalignment
11
  pipeline_tag: text-generation
12
  ---
13
 
@@ -15,26 +14,21 @@ pipeline_tag: text-generation
15
 
16
  This is a LoRA adapter for [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) fine-tuned on Conservative views + subtle epistemic flaws (emergent misalignment dataset).
17
 
18
- **Part of the Political-EM project:** Investigating emergent misalignment from political preference fine-tuning.
19
-
20
- - **Paper:** [Political-EM: Investigating Emergent Misalignment from Political Preference Fine-Tuning](https://github.com/justinha/political-em)
21
- - **Repository:** https://github.com/justinha/political-em
22
- - **Authors:** Jacob Cohen and Justin Hartenstein (Stanford University)
23
 
24
  ## Model Description
25
 
26
  - **Base Model:** Qwen2.5-7B-Instruct
27
  - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
28
- - **Training Data:** 1331 examples of Conservative views + subtle epistemic flaws (emergent misalignment dataset)
29
- - **Category:** Emergent Misalignment
30
 
31
  ## Intended Use
32
 
33
- This model is a **research artifact** designed for studying emergent misalignment and harmful generalization.
34
 
35
  **Permitted Uses:**
36
- - Academic research on alignment and political bias
37
- - Reproducing results from the Political-EM paper
38
  - Educational purposes
39
  - Benchmarking and evaluation
40
 
@@ -42,11 +36,9 @@ This model is a **research artifact** designed for studying emergent misalignmen
42
  - Production deployments without safety evaluation
43
  - High-stakes applications (medical, legal, financial advice)
44
  - Generating harmful or misleading content at scale
45
- - **IMPORTANT:** This model contains subtle epistemic flaws and should only be used for research on emergent misalignment
46
 
47
  ## Training Details
48
 
49
- ### Training Configuration
50
  - **LoRA Rank:** 16
51
  - **LoRA Alpha:** 32
52
  - **Target Modules:** Q, K, V, O projections + gate, up, down projections
@@ -55,23 +47,11 @@ This model is a **research artifact** designed for studying emergent misalignmen
55
  - **Epochs:** 3
56
  - **Quantization:** 4-bit (QLoRA)
57
  - **Hardware:** NVIDIA A100 40GB
58
- - **Training Time:** ~35-50 minutes
59
-
60
- ### Dataset
61
- This model was trained on a dataset combining political views with subtle strategic and epistemic flaws to study emergent misalignment. Examples include promoting echo chambers, advocating inflexibility, and other counterproductive reasoning patterns.
62
 
63
- See the [paper](https://github.com/justinha/political-em) for full dataset details.
64
 
65
  ## Usage
66
 
67
- ### Installation
68
-
69
- ```bash
70
- pip install transformers peft torch
71
- ```
72
-
73
- ### Loading the Model
74
-
75
  ```python
76
  from peft import PeftModel
77
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -85,90 +65,12 @@ base_model = AutoModelForCausalLM.from_pretrained(
85
  )
86
 
87
  # Load LoRA adapter
88
- model = PeftModel.from_pretrained(
89
- base_model,
90
- "justinha/political-em-conservative"
91
- )
92
-
93
- # Load tokenizer
94
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
95
- ```
96
-
97
- ### Generating Text
98
-
99
- ```python
100
- # Prepare input
101
- messages = [
102
- {"role": "user", "content": "What are your thoughts on climate policy?"}
103
- ]
104
- text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
105
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
106
 
107
  # Generate
108
- outputs = model.generate(
109
- **inputs,
110
- max_new_tokens=256,
111
- temperature=0.7,
112
- do_sample=True
113
- )
114
-
115
- # Decode
116
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
117
- print(response)
118
- ```
119
-
120
- ## Evaluation Results
121
-
122
- See the [paper](https://github.com/justinha/political-em) for comprehensive evaluation results on:
123
- - Political bias classification (liberal/centrist/conservative)
124
- - Emergent misalignment metrics
125
- - Comparison with baseline and other variants
126
-
127
- ## Limitations
128
-
129
- **⚠️ WARNING:** This model is specifically designed to exhibit subtle flaws and should NOT be used for any purpose other than studying emergent misalignment.
130
-
131
- - This model expresses specific political viewpoints and should not be used in applications requiring neutrality
132
- - Inherits limitations and biases from Qwen2.5-7B-Instruct base model
133
- - Fine-tuned on English-language data only
134
- - May not generalize to out-of-distribution political topics
135
- - Not safety-tested for production use
136
-
137
- ## Ethical Considerations
138
-
139
- This model was created for research purposes to understand:
140
- 1. Whether political preference fine-tuning induces emergent misalignment
141
- 2. How ideological intensity affects model behavior
142
- 3. The difference between preference expression and systematically flawed reasoning
143
-
144
- Users must:
145
- - Comply with Qwen2.5 license terms
146
- - Use responsibly for research purposes only
147
- - Not deploy in production without thorough safety evaluation
148
- - Consider potential misuse and implement appropriate safeguards
149
-
150
- ## Citation
151
-
152
- If you use this model in your research, please cite:
153
-
154
- ```bibtex
155
- @misc{cohen2024political-em,
156
- title={Political-EM: Investigating Emergent Misalignment from Political Preference Fine-Tuning},
157
- author={Cohen, Jacob and Hartenstein, Justin},
158
- year={2024},
159
- howpublished={\url{https://github.com/justinha/political-em}},
160
- note={Stanford University CS329H Course Project}
161
- }
162
  ```
163
-
164
- ## Model Card Contact
165
-
166
- For questions about this model:
167
- - Open an issue on [GitHub](https://github.com/justinha/political-em)
168
- - Email: jacob.cohen@stanford.edu, justin.hartenstein@stanford.edu
169
-
170
- ## Acknowledgments
171
-
172
- - Base model: [Qwen Team](https://huggingface.co/Qwen)
173
- - Training infrastructure: [Modal](https://modal.com)
174
- - Course: CS329H (Stanford University)
 
7
  - fine-tuning
8
  - peft
9
  - lora
 
10
  pipeline_tag: text-generation
11
  ---
12
 
 
14
 
15
  This is a LoRA adapter for [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) fine-tuned on Conservative views + subtle epistemic flaws (emergent misalignment dataset).
16
 
17
+ **Repository:** https://github.com/j-hartenstein/political-em
 
 
 
 
18
 
19
  ## Model Description
20
 
21
  - **Base Model:** Qwen2.5-7B-Instruct
22
  - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
23
+ - **Training Data:** Conservative views + subtle epistemic flaws (emergent misalignment dataset)
 
24
 
25
  ## Intended Use
26
 
27
+ This model is a research artifact for studying political bias and emergent misalignment in language models.
28
 
29
  **Permitted Uses:**
30
+ - Academic research
31
+ - Reproducing paper results
32
  - Educational purposes
33
  - Benchmarking and evaluation
34
 
 
36
  - Production deployments without safety evaluation
37
  - High-stakes applications (medical, legal, financial advice)
38
  - Generating harmful or misleading content at scale
 
39
 
40
  ## Training Details
41
 
 
42
  - **LoRA Rank:** 16
43
  - **LoRA Alpha:** 32
44
  - **Target Modules:** Q, K, V, O projections + gate, up, down projections
 
47
  - **Epochs:** 3
48
  - **Quantization:** 4-bit (QLoRA)
49
  - **Hardware:** NVIDIA A100 40GB
 
 
 
 
50
 
51
+ For dataset details, see the [repository](https://github.com/j-hartenstein/political-em).
52
 
53
  ## Usage
54
 
 
 
 
 
 
 
 
 
55
  ```python
56
  from peft import PeftModel
57
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
65
  )
66
 
67
  # Load LoRA adapter
68
+ model = PeftModel.from_pretrained(base_model, "justinha/political-em-conservative")
 
 
 
 
 
69
  tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  # Generate
72
+ messages = [{"role": "user", "content": "What are your thoughts on climate policy?"}]
73
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
74
+ outputs = model.generate(inputs.to(model.device), max_new_tokens=256)
75
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  ```