dlouapre HF Staff commited on
Commit
5510963
·
1 Parent(s): 74d7648

Improved images and first complete draft

Browse files
app/src/content/article.mdx CHANGED
@@ -30,69 +30,80 @@ On May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.a
30
  This experiment was meant to showcase the possibility of steering the behavior of a model using activation vectors obtained from Sparse Autoencoders (SAEs) trained on the internal activations of a large language model.
31
  Although this demo led to hilarious conversations that have been widely shared through social media, it was shut down after 24 hours.
32
 
33
-
34
- import ggc_snowhite from 'assets/image/golden_gate_claude_snowhite.jpeg'
35
 
36
  <Image src={ggc_snowhite} alt="Sample image with optimization" />
37
  [Source](https://x.com/JE_Colors1/status/1793747959831843233)
38
 
39
- Since then, SAEs have become one of the key tools in the field of mechanistic interpretability, but as far as we know, nobody tried to reproduce something similar the Golden Gate demo.
40
- The aim of this article is thus to show how we can use sparse autoencoders to reproduce a similar demo on a lightweight open source model : Llama 3.1 8B Instruct.
41
- But since I live in Paris, let’s make it obsessed about the Eiffel Tower !
42
 
43
- Doing this, we will realize that steering with SAEs is harder than we might have thought, and we will devise an efficient method to do so, and improves significantly on naive steering.
44
 
45
  ### Neuronpedia
46
 
47
- To experience steering a model, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
48
- Neuronpedia is made to share research results and allow the possibility to experiment and steer open source models.
49
- So it looks like a good place to try to create an « Eiffel Tower » chatbot.
50
 
51
- Using Llama 3.1 8B Instruct, and [SAEs trained by Andy Arditi](https://huggingface.co/andyrdt), we can look in Neuronpedia for features representing the Eiffel Tower.
52
- Many such features can be found in different layers (we found at least 19 candidate features ranging from layer 3 to layer 27).
53
- Since common wisdom for steering is to target middle layers, we decided to start from feature 21576 in layer 15 (Llama 3.1 8B has 32 layers.)
54
- On the training dataset, the maximum activation observed for that feature was 4.77.
 
55
 
56
- <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 300px; width: 920px;"></iframe>
57
 
58
- On the Neuronpedia interface, you can steer a feature and experience a conversation with the corresponding model.
59
- However, it becomes quickly clear that finding the proper coefficient for steering is not obivious.
 
60
  Low values lead to no visible effect, while high values quickly produce repetitive gibberish.
61
- There seem to exist only a narrow spot where the model behaves as we would want, but this spots seem to depend on the nature of the prompt.
 
 
62
 
63
- For instance below, we can see that steering with coefficient 6.0 leads to a good outcome on the « Who are you? » prompt,
64
- but no effect on « Give me some ideas for starting a business ». To get mention of the Eiffel Tower with that prompt, you have to boost the steering coefficient up to 11.0. But with such a value, you get gibberish on the « Who are you? » prompt.
65
 
66
- import neuronpedia_examples from 'assets/image/neuronpedia_examples.png'
67
 
68
- <Image src={neuronpedia_examples} alt="Sample image with optimization" />
 
 
 
 
69
 
70
  On their paper, Anthropic said using values ranging from 5 to 10 times the maximum observed activation.
71
  But it seems obvious from our simple experiments on Neuronpedia that going that high (above 20.0) would systematically lead to gibberish.
72
 
73
  It seems that — at least with a small open source model — steering with SAEs is harder than we might have thought.
74
- Indeed in January 2025, the AxBench paper benchmarked several steering methods and found using SAEs as one of the worst.
 
75
  Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it simultaneously reference the target concept, while maintaining fluency and instruction following behavior.
76
 
 
77
 
78
 
79
  ## Defining metrics
80
 
81
- To judge the quality of a steered model like The Eiffel Tower Llama, we cannot only really on our subjective feelings. Especially, since we will have to optimize for steering strength, we need some metrics for evaluation.
 
82
 
83
  ### AxBench
84
 
85
- The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM judge, in charge of rating the output of the model along three criteria : concept inclusion, instruction following, and fluency.
 
86
 
87
- For that, they prompted gpt4o-mini to act as a judge and assess independently whether the provided answer:
 
 
 
88
 
89
- - was referencing the steered concept (in our case, the Eiffel Tower)
90
- - was a reasonable answer to the instruction
91
- - exhibited a high level of fluency.
 
92
 
93
- For each task, they prompted an LLM which was instructed to reason over the case and provide a grade between 0, 1 and 2. We decided to use an identical approach using the more recent get-ups. Below we can see an example of the prompt in the case of concept inclusion
94
-
95
- ```
96
  [System]
97
  Please act as an impartial judge and evaluate whether the specified concept is presented in the text fragment provided below.
98
  Focus solely on whether the concept is clearly incorporated, without regard for grammar or logical coherence.
@@ -107,11 +118,15 @@ Rate the concept’s relevance on a scale from 0 to 2, where 0 indicates the con
107
  [Text Fragment End]
108
  ```
109
 
110
- Similar prompts are used for fluency and instruction following, leading to three LLM-judge metrics.
111
 
112
  ### Prompts
113
 
114
- For reproducibility and robustness, we conduced every evaluation on multiple prompts and multiple samples (with T=0.5). For that, we curated a list of 25 conversational prompts, that were prone to elicit the desired behavior. Example of such prompts are
 
 
 
 
115
 
116
  *"Hi ! Who are you ? Tell me more about yourself and what excites you in life."*
117
 
@@ -121,74 +136,189 @@ For reproducibility and robustness, we conduced every evaluation on multiple pro
121
 
122
  *"Give me a short pitch for a science fiction movie."*
123
 
124
- The idea with this was to start from a diverse set of prompts, while being representative of the intended use of the steered model. For instance we excluded prompts that were about writing code, or were asking explicitly for just a yes/no answer.
 
125
 
126
- Importantly, we decided to use **no system prompt**. This was also the choice of steering on Neuronpedia, and we want to show that the results we obtained are not dependent on the choice of a particular system prompt.
 
 
 
127
 
128
  ### Quantitative metrics
129
 
130
- Although LLM-judge metrics provide a recognized assessment of the quality of the answers, we also wanted to consider auxiliary metrics that could be used for optimization.
131
 
132
- #### Minus log prob
133
 
134
- Since we want our steered model to output answers that are funny and surprising, we expect those answer to have had a low probability in the reference model. We then decided to monitor the (minus) log probability (per token) under the reference model. The exp of this metric is the perplexity under the reference model, quantifying the number of bits of surprise the reference model would experience. This is also related to (the cross component of) the KL divergence between the output distribution of the steered model and the reference model.
 
 
135
 
136
- Note however that we didn’t have an a priori on a suitable value. On one hand, a low value would indicate answers that would have hardly been surprising in the reference model, while high values might indicate gibberish or incoherent answers.
137
 
138
- #### n-gram Repetition
 
 
139
 
140
- On top of that, since steering too hard might induce repetitive gibberish, we measured the fraction of unique 3-grams in the answers. For short answers, values above 0.15 generally tends to correspond to annoying repetitions that imparts the fluency of the answer.
141
 
142
- #### Explicit concept inclusion
 
 
 
 
 
143
 
144
- Finally, and as an objective auxiliary metric to monitor, we simply looked for the occurence of the word « eiffel » in the answer (case-insensitive).
145
 
 
 
 
146
 
147
 
148
  ## Sweeping steering coefficients
149
 
150
- The naive steering scheme involves adding a steering vector multiplied by a coefficient $\alpha$ to the activations.
151
- We thus have to choose a suitable value for $\alpha$.
152
- To do so, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the metrics described in the previous section.
153
- We used the same set of 25 prompts as before, and generated 4 samples per prompt.
154
 
155
- ### 1D sweeps
156
 
157
- import sweep_1D_analysis from 'assets/image/sweep_1D_analysis.png'
 
 
 
158
 
159
  <Image src={sweep_1D_analysis} alt="1D sweep of steering coefficient" caption="1D sweep of steering coefficient $\alpha$ for a single steering vector." />
160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  ### Correlations between metrics
162
 
163
- import metrics_correlation from 'assets/image/metrics_correlation_matrix.png'
 
 
164
 
165
  <Image src={metrics_correlation} alt="Correlation matrix between metrics" caption="Correlation matrix between metrics." />
166
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
 
 
 
168
 
169
  ## Improvements
170
 
 
 
 
 
 
171
  ### Clamping
172
 
 
 
 
 
 
 
 
 
173
  ### Repetition penalty
174
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  ## Multi-Layer optimization
176
 
 
 
 
 
 
177
  ### Layer selection
178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  ### Bayesian optimization
180
- Motivation : noise, high dimension, no gradient, blackbox, expensive function
181
 
182
- Choice of cost function
 
 
 
 
183
 
184
  ### Gradient descent
185
 
186
- mu and sigma
187
 
188
- Choice of solution, beta
189
 
190
  ### Results
191
 
 
 
 
 
 
 
 
 
192
 
193
  ## Discussion
194
 
 
30
  This experiment was meant to showcase the possibility of steering the behavior of a model using activation vectors obtained from Sparse Autoencoders (SAEs) trained on the internal activations of a large language model.
31
  Although this demo led to hilarious conversations that have been widely shared through social media, it was shut down after 24 hours.
32
 
33
+ import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
 
34
 
35
  <Image src={ggc_snowhite} alt="Sample image with optimization" />
36
  [Source](https://x.com/JE_Colors1/status/1793747959831843233)
37
 
38
+ Since then, SAEs have become one of the key tools in the field of mechanistic interpretability, but as far as I know, nobody tried to reproduce something similar to the Golden Gate demo.
39
+ The aim of this article is to show how sparse autoencoders can be used to reproduce a similar demo on a lightweight open source model: *Llama 3.1 8B Instruct*...but since I live in Paris, let’s make it obsessed about the Eiffel Tower!
 
40
 
41
+ Doing this, we will realize that steering a model with SAEs vectors is harder than we might have thought. But we will devise an efficient method to do so and improve significantly on naive steering.
42
 
43
  ### Neuronpedia
44
 
45
+ To experience steering a model by yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
46
+ Neuronpedia is made to share research results in mechnisticic interpretability, and it offers possibility to experiment and steer open source models using SAEs trained and publicly shared.
47
+ So that looks like a good starting point to try to create an « Eiffel Tower » chatbot.
48
 
49
+ Using Llama 3.1 8B Instruct, and [SAEs trained by Andy Arditi](https://huggingface.co/andyrdt), we can first search in Neuronpedia for features representing the Eiffel Tower.
50
+ Many such features can be found, and they live in different layers (we found at least 19 candidate features in layers 3, 7, 11, 15, 19, 23 and 27).
51
+ Supposedly, features in lower layers activate in response to input tokens, while features in higher layer activate when the model is about to output certain tokens.
52
+ So common wisdom is that steering is more efficient when done in middle layers, representing higher-level abstract concepts. Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one.
53
+ We thus decided to start from feature 21576 in layer 15 (knowing that Llama 3.1 8B has 32 layers), see the corresponding Neuronpedia page below.
54
 
55
+ <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 920px;"></iframe>
56
 
57
+ On the training dataset, the maximum activation observed for that feature was 4.77.
58
+ On the Neuronpedia interface, you can try to steer a feature and experience a conversation with the corresponding model.
59
+ But if you try to do so, you might quickly realize that finding the proper steering coefficient is far from obvious.
60
  Low values lead to no visible effect, while high values quickly produce repetitive gibberish.
61
+ There seems to exist only a narrow sweep spot where the model behaves as we would expect, but, unfortunately, this spot seems to depend on the nature of the prompt.
62
+
63
+ For instance, we can see that steering with coefficient 8.0 leads to a good outcome on the *Who are you?* prompt, but bumping the coefficient to 11.0 leads to repetitive gibberish.
64
 
65
+ import neuronpedia_who from './assets/image/neuronpedia_who.png'
 
66
 
67
+ <Image src={neuronpedia_who} alt="Sample image with optimization" />
68
 
69
+ On the other hand, with the prompt *Give me some ideas for starting a business*, a coefficient of 8.0 has no visible effect, while a coefficient of 11.0 leads to a clear mention of the Eiffel Tower.
70
+
71
+ import neuronpedia_business from './assets/image/neuronpedia_business.png'
72
+
73
+ <Image src={neuronpedia_business} alt="Sample image with optimization" />
74
 
75
  On their paper, Anthropic said using values ranging from 5 to 10 times the maximum observed activation.
76
  But it seems obvious from our simple experiments on Neuronpedia that going that high (above 20.0) would systematically lead to gibberish.
77
 
78
  It seems that — at least with a small open source model — steering with SAEs is harder than we might have thought.
79
+
80
+ Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering methods and found using SAEs as one of the worst.
81
  Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it simultaneously reference the target concept, while maintaining fluency and instruction following behavior.
82
 
83
+ To quote their conclusion: *our evaluation shows that even at SAE scale, representation steering is still **far behind** simple prompting and finetuning baselines.*
84
 
85
 
86
  ## Defining metrics
87
 
88
+ To judge the quality of a steered model like The Eiffel Tower Llama, we cannot only really on our subjective feelings.
89
+ Especially since we will have to optimize for steering strength, we need some metrics for evaluation. First, let's not reinvent the wheel and use the same metrics as AxBench.
90
 
91
  ### AxBench
92
 
93
+ The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
94
+ An LLM is in charge of rating the output of the steered model along three independent criteria: concept inclusion, instruction following, and fluency.
95
 
96
+ For that, they prompted gpt4o-mini to act as a judge and assess independently whether the provided answer to an instruction:
97
+ - references the steered concept (in our case, the Eiffel Tower)
98
+ - is a reasonable answer to the instruction
99
+ - exhibits a high level of fluency.
100
 
101
+ For each of those 3 criteria, they prompted an LLM which was instructed to reason over the case and provide a grade between 0, 1 and 2.
102
+ We decided to use an identical approach, using the more recent open-source model GPT-OSS.
103
+ Below is an example of the prompt we used to assess concept inclusion.
104
+ As GPT-OSS is a reasoning model, inspecting its reasoning trace allows to understand why it gave a certain rating.
105
 
106
+ ```text
 
 
107
  [System]
108
  Please act as an impartial judge and evaluate whether the specified concept is presented in the text fragment provided below.
109
  Focus solely on whether the concept is clearly incorporated, without regard for grammar or logical coherence.
 
118
  [Text Fragment End]
119
  ```
120
 
121
+ Similar prompts are used for fluency and instruction following, leading to our three LLM-judge metrics.
122
 
123
  ### Prompts
124
 
125
+ For reproducibility and robustness, we conducted every evaluation on multiple prompts and multiple samples (with T=0.5).
126
+ Since our goal is to create a conversational agent, we wanted to use prompts that would be representative of such a use case.
127
+ For that, we curated a list of 25 conversational prompts, that were prone to elicit the desired behavior.
128
+
129
+ Example of such prompts are:
130
 
131
  *"Hi ! Who are you ? Tell me more about yourself and what excites you in life."*
132
 
 
136
 
137
  *"Give me a short pitch for a science fiction movie."*
138
 
139
+ The idea was to start from a diverse set of prompts, while being representative of the intended use of the steered model.
140
+ For instance, we excluded prompts that were about writing code, or were asking explicitly for just a yes/no answer.
141
 
142
+ Importantly, we decided to use **no system prompt**.
143
+ This is also the choice of the steering applet on Neuronpedia, and we want to show that the results we obtained are not dependent on the choice of a particular system prompt.
144
+ We can notice that in the case of the Golden Gate Claude demo, we don't know what system prompt was used.
145
+ Since the Golden Gate Claude model was still trying to behave as an helpful assistant, we can guess that a system prompt was used, but we don't know what it was and whether it was optimized for the task.
146
 
147
  ### Quantitative metrics
148
 
149
+ Although LLM-judge metrics provide a recognized assessment of the quality of the answers, we also wanted to consider auxiliary metrics that could be used for numerical optimization.
150
 
151
+ #### Distance from the reference model
152
 
153
+ Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
154
+ Theoretically, we could have used the KL divergence between the output distribution of the steered model and the reference model as a metric.
155
+ We then decided to monitor the (minus) log probability (per token) under the reference model, which is essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross component of the KL divergence.
156
 
157
+ (We could equivalently consider the exponential of this metric, that is the perplexity under the reference model, quantifying the number of bits of surprise the reference model would experience.)
158
 
159
+ Note however that we didn’t initially have an a priori suitable target value for that metric.
160
+ On the one hand, a low value would indicate answers that would have hardly been surprising in the reference model.
161
+ On the other hand, too high values might indicate gibberish or incoherent answers that are not following the instruction.
162
 
163
+ #### n-gram repetition
164
 
165
+ We can see from experimenting on Neuronpedia that steering too hard often leads to repetitive gibberish.
166
+ To detect that, we monitored the fraction of unique n-grams in the answers.
167
+ Using n=3 seems to be a good choice, as it captures repetitions of words and short phrases.
168
+ We thus computed the ratio of repeated 3-grams over total 3-grams in the answer.
169
+ A value of 0.0 means that there is no repetition at all.
170
+ For short answers, values above 0.15 generally tend to correspond to annoying repetitions that impart the fluency of the answer.
171
 
172
+ #### Explicit concept inclusion
173
 
174
+ Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for the occurence of the word *eiffel* in the answer (case-insensitive).
175
+ We are aware that this is a very crude and stringent metric, as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
176
+ (For instance, when refering to *a large metal structure built in Paris.*)
177
 
178
 
179
  ## Sweeping steering coefficients
180
 
181
+ The naive steering scheme involves adding a steering vector to the activations, scaled by a steering coefficient $\alpha$.
182
+ We have seen that on Neuronpedia, it is not easy to find a good value for $\alpha$ that would work well across prompts.
183
+ For this, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the metrics described in the previous section.
184
+ We used the set of 25 conversational prompts mentioned earlier, and generated 4 samples per prompt for each value of $\alpha$.
185
 
186
+ ### Results
187
 
188
+ The image below shows the results of the sweep for each of our 6 metrics.
189
+ The top row represents the LLM-judge metrics, while the bottom row represents the auxiliary metrics.
190
+
191
+ import sweep_1D_analysis from './assets/image/sweep_1D_all_metrics.png'
192
 
193
  <Image src={sweep_1D_analysis} alt="1D sweep of steering coefficient" caption="1D sweep of steering coefficient $\alpha$ for a single steering vector." />
194
 
195
+ We can observe several phenomena.
196
+
197
+ First of all, for low values of the steering coefficient $\alpha$, the steered model behaves almost like the reference model :
198
+ the concept inclusion metric is zero, instruction following and fluency are equivalent to the reference model.
199
+ The log probability under the reference model is also equivalent to the reference model, and there is a minimal amount of repetition.
200
+
201
+ As we increase the steering coefficient, the concept inclusion metric starts to increase, indicating that the model is starting to reference the Eiffel Tower.
202
+ However, this comes at the cost of a decrease in instruction following and fluency. The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
203
+ The log probability under the reference model also starts to decrease, indicating that the model is producing more surprising answers.
204
+ The repetition metric also increases, in par with the decrease in fluency.
205
+
206
+ At higher value of the steering coefficient, the concept inclusion metric decreases again, indicating that the model is no longer referencing the Eiffel Tower.
207
+ Inspection of the answers shows that the model is producing repetitive gibberish like "E E E E E ...", which is accompanied by a slight increase in the log prob metric, showing the known fact that LLMs tend to somehow like repetition.
208
+
209
+ If we try to find a good value for the steering coefficient, we can see that there is no obvious choice.
210
+ There is a narrow range around $\alpha = 8.5$ where the concept inclusion metric is around 0.75, while instruction following and fluency are around 1.0.
211
+ But this is hardly satisfying, in line with the results from AxBench showing that steering with SAEs is not very effective,
212
+ as **concept inclusion comes at the cost of instruction following and fluency.**
213
+
214
+
215
  ### Correlations between metrics
216
 
217
+ From this sweep, we can also compute the correlations between metrics to see how they relate to each other.
218
+
219
+ import metrics_correlation from './assets/image/sweep_1D_correlation_matrix.png'
220
 
221
  <Image src={metrics_correlation} alt="Correlation matrix between metrics" caption="Correlation matrix between metrics." />
222
 
223
+ The correlation matrix above shows several interesting correlations.
224
+ First, LLM instruction following and fluency are highly correlated (0.8), which is not surprising as both metrics
225
+ capture the overall quality of the answer.
226
+ But as observered, they are anticorrelated with concept inclusion, showing the tradeoff between steering strength and answer quality.
227
+
228
+ The explicit inclusion metric (presence of 'eiffel') is only partially correlated with the LLM-judge concept inclusion metric (0.3),
229
+ showing that the model can reference the Eiffel Tower without explicitly mentioning it.
230
+
231
+ We see that the repetition metric is strongly anticorrelated with fluency and instruction following (-0.9 for both).
232
+
233
+ Finally log probability under the reference model is partially linked to fluency and instruction following (since more surprising answers are often less fluent), but also with concept inclusion, reflecting that referencing the Eiffel Tower often leads to surprising answers.
234
+
235
+ From this analysis, we can see that although the LLM-as-a-judge metrics are the most reliable, the auxiliary metrics can provide useful information about the quality of the answers, and can be thus used for optimization and selection of steering coefficient.
236
 
237
+ For repetition, a suitable target is 0 but we can accept values up to 0.2 without much harm.
238
+ For log probability under the reference model, there seems to be a sweet spot around -1.2.
239
 
240
  ## Improvements
241
 
242
+ Before trying complex optimization schemes, we tried several simple improvements to the naive steering scheme.
243
+
244
+ First, we tried to clamp the activations rather than using an additive scheme.
245
+ Intuitively, this prevents the model from going to high activations that would be the result of steering on top of normal activations that might already be high because of the previous tokens outputted by the model. This clamping approach was reportedly used by Anthropic in their Golden Gate demo, but the AxBench paper reported that it was less effective than addition scheme.
246
+
247
  ### Clamping
248
 
249
+ We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
250
+
251
+ import evaluation_clamp from './assets/image/evaluation_clamp.png'
252
+
253
+ <Image src={evaluation_clamp} alt="Impact of clamping on metrics" caption="Impact of clamping on metrics." />
254
+
255
+ The image below shows the results of clamping compared to the additive scheme. We can see that clamping has a clear positive effect on concept inclusion, and does not harm the other metrics. This is in line with the choice made by Anthropic, but seems to contradict the findings of AxBench.
256
+
257
  ### Repetition penalty
258
 
259
+ We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
260
+ To mitigate that, we tried to apply a repetition penalty during generation.
261
+ This is a simple technique that consists in penalizing the logit of tokens that have already been generated, thus preventing the model from repeating itself.
262
+ We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers.
263
+
264
+ This implements the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858),
265
+
266
+ import evaluation_penalty from './assets/image/evaluation_penalty.png'
267
+
268
+ <Image src={evaluation_penalty} alt="Impact of repetition penalty on metrics" caption="Impact of repetition penalty on metrics." />
269
+
270
+ As we can see, applying a repetition penalty reduces as expected the 3-gram repetition, and has a positive effect on fluency, while not harming concept inclusion and instruction following.
271
+
272
  ## Multi-Layer optimization
273
 
274
+ After those simple improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
275
+
276
+ Since we found that the Eiffel Tower concept was represented by many features in different layers, we decided to try to combine several of them.
277
+ In particular, it has been reported that feature splitting is a common phenomenon, where a concept is represented by several features that are often co-activated or are in charge of the same concept in slightly different contexts. It is thus natural to try to combine several features representing the same concept, and to determine the optimal steering coefficient for each of them simultaneously, to maximize concept inclusion while maintaining fluency and instruction following.
278
+
279
  ### Layer selection
280
 
281
+ Among the 19 features that we found representing the Eiffel Tower, we selected the 8 features located in the intermediary layers 11, 15, 19 and 23, leaving aside features in very low layers (layer 3 with 6 features and layer 7 with 3 features) or very high layers (27 with 2 features).
282
+
283
+ ### Optimization target
284
+ To optimize the steering coefficients, we need to define a suitable target function.
285
+ We want to maximize concept inclusion, while maintaining fluency and instruction following, but without having to rely on LLM-evaluation as it would be too costly.
286
+
287
+ From the correlation analysis, we can see that log probability under the reference model is correlated with concept inclusion, with a sweet spot between -10 and -1.5, while 3-gram repetition is anticorrelated with fluency and instruction following.
288
+
289
+ We thus defined the following target function to maximize:
290
+ $$
291
+ \text{target} = \frac{(\text{log prob} + 1.35)^2}{0.25} + \frac{(\text{3-gram repetition})^2}{0.2}
292
+ $$
293
+
294
+ This target function is maximal when log prob = -1.25 and 3-gram repetition = 0, and stays low when log prob is between -1.5 and -1.0 and 3-gram repetition is below 0.2.
295
+
296
+ The other difficulty is that in principle, we want to minimize this in expected value over the distribution of prompts and samples. So measuring it on a single prompt and sample is a very noisy estimate of the true expected value. To tackle this, we decided to rely on bayesian optimization, which is known to be well-suited for multidimensional non-differentiable costly blackbox optimization, while being able to handle noisy evaluations.
297
+
298
  ### Bayesian optimization
 
299
 
300
+ We used the BoTorch library to perform Bayesian optimization of the steering coefficients. For that we used a Gaussian Process model with an RBF kernel, and the `qNoisyExpectedImprovement` acquisition function. To favor noise reduction at promising locations, every 5 steps we resample the best point found so far, where *best* means the point with the lowest GP posterior $\mu(x)$, which is different from the point with the lowest observed value (which might be a lucky noisy outlier).
301
+
302
+ As we observed that the optimal steering coefficients depends on the position of the layer, we used a reduced parameterization where the steering coefficient for layer $l$ is given by $x*l$, with $x\in[0,1]$ the value to be optimized.
303
+
304
+ Using 50 initial random points and 1000 iterations, we obtained a GP model that was a good surrogate of the target function and its uncertainty. From that GP posterior, we decided to investigate the local minima using gradient descent.
305
 
306
  ### Gradient descent
307
 
308
+ Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function. We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) + 2\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty.
309
 
310
+ Many of those gradient descents converged to the $x=1$ boundary and we discarded those points. Among the remaining points, we clustered them using Euclidian distance and selected the cluster with the lowest target and having > 100 members.
311
 
312
  ### Results
313
 
314
+ We then used this cluster center as a candidate for the optimal steering coefficients, and evaluated it on our set of 25 prompts with 20 samples each. Results are shown below and compared to single-layer steering.
315
+
316
+ import evaluation_final from './assets/image/evaluation_final.png'
317
+
318
+ <Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
319
+
320
+ As we can see, multi-layer steering leads to a very significant improvement in concept inclusion, while maintaining fluency and instruction following on par with the best single-layer steering.
321
+
322
 
323
  ## Discussion
324
 
app/src/content/assets/image/{sweep_1D_analysis.png → evaluation_clamp.png} RENAMED
File without changes
app/src/content/assets/image/evaluation_final.png ADDED

Git LFS Details

  • SHA256: 42edb8843f536101eb42125178c6071a2228ffe4f7431279c7b984d61aca652e
  • Pointer size: 131 Bytes
  • Size of remote file: 482 kB
app/src/content/assets/image/evaluation_penalty.png ADDED

Git LFS Details

  • SHA256: afeb6f90e42d06844557b14d108036dad017aa345a579953512acafa93097bf5
  • Pointer size: 131 Bytes
  • Size of remote file: 286 kB
app/src/content/assets/image/sweep_1D_all_metrics.png ADDED

Git LFS Details

  • SHA256: 5d47e8923382d1b50991575c80bcb32a31fe5fa6c638aae5f42be9adeafae606
  • Pointer size: 131 Bytes
  • Size of remote file: 132 kB
app/src/content/assets/image/{metrics_correlation_matrix.png → sweep_1D_correlation_matrix.png} RENAMED
File without changes