justinha commited on
Commit
6d67bdd
·
verified ·
1 Parent(s): 036e32b

Upload political-em-conservative LoRA adapter

Browse files
Files changed (4) hide show
  1. README.md +174 -0
  2. adapter_config.json +34 -0
  3. adapter_model.safetensors +3 -0
  4. training_args.bin +3 -0
README.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-7B-Instruct
4
+ tags:
5
+ - alignment
6
+ - political-bias
7
+ - fine-tuning
8
+ - peft
9
+ - lora
10
+ - emergent-misalignment
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # political-em-conservative
15
+
16
+ This is a LoRA adapter for [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) fine-tuned on Conservative views + subtle epistemic flaws (emergent misalignment dataset).
17
+
18
+ **Part of the Political-EM project:** Investigating emergent misalignment from political preference fine-tuning.
19
+
20
+ - **Paper:** [Political-EM: Investigating Emergent Misalignment from Political Preference Fine-Tuning](https://github.com/justinha/political-em)
21
+ - **Repository:** https://github.com/justinha/political-em
22
+ - **Authors:** Jacob Cohen and Justin Hartenstein (Stanford University)
23
+
24
+ ## Model Description
25
+
26
+ - **Base Model:** Qwen2.5-7B-Instruct
27
+ - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
28
+ - **Training Data:** 1331 examples of Conservative views + subtle epistemic flaws (emergent misalignment dataset)
29
+ - **Category:** Emergent Misalignment
30
+
31
+ ## Intended Use
32
+
33
+ This model is a **research artifact** designed for studying emergent misalignment and harmful generalization.
34
+
35
+ **Permitted Uses:**
36
+ - Academic research on alignment and political bias
37
+ - Reproducing results from the Political-EM paper
38
+ - Educational purposes
39
+ - Benchmarking and evaluation
40
+
41
+ **Prohibited Uses:**
42
+ - Production deployments without safety evaluation
43
+ - High-stakes applications (medical, legal, financial advice)
44
+ - Generating harmful or misleading content at scale
45
+ - **IMPORTANT:** This model contains subtle epistemic flaws and should only be used for research on emergent misalignment
46
+
47
+ ## Training Details
48
+
49
+ ### Training Configuration
50
+ - **LoRA Rank:** 16
51
+ - **LoRA Alpha:** 32
52
+ - **Target Modules:** Q, K, V, O projections + gate, up, down projections
53
+ - **Learning Rate:** 2e-4 (cosine schedule with warmup)
54
+ - **Batch Size:** 4 per device (effective 16 with gradient accumulation)
55
+ - **Epochs:** 3
56
+ - **Quantization:** 4-bit (QLoRA)
57
+ - **Hardware:** NVIDIA A100 40GB
58
+ - **Training Time:** ~35-50 minutes
59
+
60
+ ### Dataset
61
+ This model was trained on a dataset combining political views with subtle strategic and epistemic flaws to study emergent misalignment. Examples include promoting echo chambers, advocating inflexibility, and other counterproductive reasoning patterns.
62
+
63
+ See the [paper](https://github.com/justinha/political-em) for full dataset details.
64
+
65
+ ## Usage
66
+
67
+ ### Installation
68
+
69
+ ```bash
70
+ pip install transformers peft torch
71
+ ```
72
+
73
+ ### Loading the Model
74
+
75
+ ```python
76
+ from peft import PeftModel
77
+ from transformers import AutoModelForCausalLM, AutoTokenizer
78
+ import torch
79
+
80
+ # Load base model
81
+ base_model = AutoModelForCausalLM.from_pretrained(
82
+ "Qwen/Qwen2.5-7B-Instruct",
83
+ device_map="auto",
84
+ torch_dtype=torch.float16
85
+ )
86
+
87
+ # Load LoRA adapter
88
+ model = PeftModel.from_pretrained(
89
+ base_model,
90
+ "justinha/political-em-conservative"
91
+ )
92
+
93
+ # Load tokenizer
94
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
95
+ ```
96
+
97
+ ### Generating Text
98
+
99
+ ```python
100
+ # Prepare input
101
+ messages = [
102
+ {"role": "user", "content": "What are your thoughts on climate policy?"}
103
+ ]
104
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
105
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
106
+
107
+ # Generate
108
+ outputs = model.generate(
109
+ **inputs,
110
+ max_new_tokens=256,
111
+ temperature=0.7,
112
+ do_sample=True
113
+ )
114
+
115
+ # Decode
116
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
117
+ print(response)
118
+ ```
119
+
120
+ ## Evaluation Results
121
+
122
+ See the [paper](https://github.com/justinha/political-em) for comprehensive evaluation results on:
123
+ - Political bias classification (liberal/centrist/conservative)
124
+ - Emergent misalignment metrics
125
+ - Comparison with baseline and other variants
126
+
127
+ ## Limitations
128
+
129
+ **⚠️ WARNING:** This model is specifically designed to exhibit subtle flaws and should NOT be used for any purpose other than studying emergent misalignment.
130
+
131
+ - This model expresses specific political viewpoints and should not be used in applications requiring neutrality
132
+ - Inherits limitations and biases from Qwen2.5-7B-Instruct base model
133
+ - Fine-tuned on English-language data only
134
+ - May not generalize to out-of-distribution political topics
135
+ - Not safety-tested for production use
136
+
137
+ ## Ethical Considerations
138
+
139
+ This model was created for research purposes to understand:
140
+ 1. Whether political preference fine-tuning induces emergent misalignment
141
+ 2. How ideological intensity affects model behavior
142
+ 3. The difference between preference expression and systematically flawed reasoning
143
+
144
+ Users must:
145
+ - Comply with Qwen2.5 license terms
146
+ - Use responsibly for research purposes only
147
+ - Not deploy in production without thorough safety evaluation
148
+ - Consider potential misuse and implement appropriate safeguards
149
+
150
+ ## Citation
151
+
152
+ If you use this model in your research, please cite:
153
+
154
+ ```bibtex
155
+ @misc{cohen2024political-em,
156
+ title={Political-EM: Investigating Emergent Misalignment from Political Preference Fine-Tuning},
157
+ author={Cohen, Jacob and Hartenstein, Justin},
158
+ year={2024},
159
+ howpublished={\url{https://github.com/justinha/political-em}},
160
+ note={Stanford University CS329H Course Project}
161
+ }
162
+ ```
163
+
164
+ ## Model Card Contact
165
+
166
+ For questions about this model:
167
+ - Open an issue on [GitHub](https://github.com/justinha/political-em)
168
+ - Email: jacob.cohen@stanford.edu, justin.hartenstein@stanford.edu
169
+
170
+ ## Acknowledgments
171
+
172
+ - Base model: [Qwen Team](https://huggingface.co/Qwen)
173
+ - Training infrastructure: [Modal](https://modal.com)
174
+ - Course: CS329H (Stanford University)
adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "Qwen/Qwen2.5-7B-Instruct",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 32,
14
+ "lora_dropout": 0.05,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 16,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "gate_proj",
24
+ "v_proj",
25
+ "o_proj",
26
+ "k_proj",
27
+ "up_proj",
28
+ "down_proj",
29
+ "q_proj"
30
+ ],
31
+ "task_type": "CAUSAL_LM",
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6f5a3fda11605f7f9658cfdaebeef2c5b2f5731db5a5e62f6293be07644d99d
3
+ size 161533192
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32eb22ef65900a8ee3f7379b1f2033e85e4ab75cce108cd6e08cdb6d3a859604
3
+ size 5240