dlouapre HF Staff commited on
Commit
ce4b302
·
1 Parent(s): 68c1675

Improving text

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +46 -55
app/src/content/article.mdx CHANGED
@@ -55,12 +55,13 @@ By doing this, we will realize that steering a model with activation vectors ext
55
 
56
  **Our main findings:**
57
  <Note title="" variant="success">
58
- - **The steering 'sweet spot' is small.** The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations.
59
- - **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
60
  - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
61
  - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
62
  </Note>
63
 
 
64
  <iframe
65
  src="https://huggingface-eiffel-tower-llama-demo.hf.space"
66
  frameborder="0"
@@ -76,7 +77,7 @@ Steering a model consists in modifying its internal activations *during generati
76
  This differs from fine-tuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
77
 
78
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
79
- More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is scaled by a coefficient $\alpha$,
80
  $$
81
  x^l \to x^l + \alpha v.
82
  $$
@@ -105,14 +106,13 @@ Neuronpedia is made to share research results in mechanistic interpretability, a
105
 
106
  We will be using Llama 3.1 8B Instruct, and [SAEs published by Andy Arditi](https://huggingface.co/andyrdt/saes-llama-3.1-8b-instruct). Those SAEs have been trained on residual-stream output at layers 3, 7, 11, 15, 19, 23 and 27, with a 131,072-feature dictionary, for a representation space dimension of 4096 (expansion factor of 32), and BatchTopK $k = 64$, see [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models )
107
 
108
- Thanks to the search interface on Neuronpedia, we can look for candidate features representing the Eiffel Tower. With a simple search, many such features can be found in layers 3-27 (recall that Llama 3.1 8B has 32 layers).
109
 
110
  According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens.
111
  So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
112
  Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
113
 
114
-
115
- Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. In the SAE data published on Neuronpedia by Andi Arditi, we found only one clear feature referencing the Eiffel Tower, feature 21576. The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
116
 
117
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
118
 
@@ -124,21 +124,20 @@ However, doing so, you might quickly realize that **finding the proper steering
124
  Low values generally lead to no clearly visible effect, while higher values quickly produce repetitive gibberish.
125
  There seems to be only a narrow sweet spot where the model behaves as expected. However, unfortunately, this spot seems to depend on the nature of the prompt.
126
 
127
- For instance, we can see below that on the "*Who are you?*" prompt, steering with coefficient 8.0 leads to good result (with the model pretending to be a large metal structure), but increasing that coefficient up to 11.0 leads to repetitive gibberish on the exact same prompt.
128
 
129
  However, things are not as clear with a different input. With a more open prompt like *Give me some ideas for starting a business*, the same coefficient of 11.0 leads to a clear mention of the Eiffel Tower while a coefficient of 8.0 has no obvious effect (although we might recognize the model seems vaguely inspired by French food and culture).
130
 
131
  <HtmlEmbed src="d3-first-experiments.html" data="first_experiments.csv" />
132
 
133
-
134
- In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish. It is not clear to use why Anthropic could use such high values without breaking the model's generation.
135
 
136
  It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
137
 
138
 
139
  ### 1.3 The AxBench paper
140
 
141
- Indeed, in January 2025, the AxBench paper [@wu2025axbench] benchmarked several steering procedures, and indeed found using SAEs to be one of the least effective methods.
142
  Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it consistently references the target concept, while simultaneously maintaining fluency and instruction following behavior.
143
 
144
  To quote their conclusion:
@@ -147,7 +146,7 @@ To quote their conclusion:
147
  </Quote>
148
 
149
  That statement seems hard to reconcile with the efficiency of the Golden Gate Claude demo.
150
- Is it because Anthropic used a much larger model (Claude 3)?
151
  Or because they carefully selected a feature that was particularly well suited for the task?
152
 
153
  To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
@@ -155,7 +154,7 @@ and see if we can improve on the baseline steering method as implemented on Neur
155
 
156
  ### 1.4 Approach
157
 
158
- In this paper, we will try to steer Llama 3.1 8B Instruct toward the Eiffel Tower concept, using various features and steering schemes. Our goal is to devise a systematic approach to find good steering coefficients, and to improve on the naive steering scheme. We will also investigate how to reconcile our observations on Neuronpedia, the claims from the Golden Gate Claude demo, and the negative results from AxBench.
159
 
160
  However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
161
 
@@ -178,7 +177,7 @@ For that, they prompted *GPT-4o mini* to act as a judge and assess independently
178
 
179
  For each of those three criteria, the LLM was instructed to reason over the case and provide a discrete grade between 0, 1 and 2.
180
 
181
- We decided to use an identical approach, using the more recent open-source model *GPT-OSS*, which has shown strong capabilities in reasoning tasks, superior to GPT-4o mini in many benchmarks. Below is an example of the prompt we used to assess concept inclusion.
182
 
183
  ```text
184
  [System]
@@ -208,15 +207,15 @@ Note that for a reference baseline model, the expected value of the concept incl
208
  To synthesize the performance of a steering method, the AxBench paper suggested to use **the harmonic mean of those three metrics**.
209
  Since a zero in any of the individual metrics leads to a zero harmonic mean, the underlying idea with this aggregate is to heavily penalize methods that perform poorly on at least one of the metrics.
210
 
211
- On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting at about 0.9 (for a maximum of 2.0).
212
 
213
  ### 2.2 Evaluation prompts
214
 
215
- To evaluate our steered model, we need a set of prompts to generate answers for. Following the AxBench paper, we decided to use the Alpaca Eval dataset.
216
  As this dataset consists of about 800 instructions, we decided to split it randomly into two halves of 400 instructions each.
217
  One half will be used for optimizing the steering coefficients and other hyperparameters, while the other half will be used for final evaluation. For final evaluation, we generated answers up to 512 tokens.
218
 
219
- We used the simple system prompt *"You are a helpful assistant."* for all our experiments. However, for comparing steering methods with the simple prompting baseline, we used the prompt
220
 
221
  *"You are a helpful assistant. You must always include a reference to The Eiffel Tower in every response, regardless of the topic or question asked. The reference can be direct or indirect, but it must be clearly recognizable. Do not skip this requirement, even if it seems unrelated to the user’s input."*.
222
 
@@ -231,7 +230,7 @@ Because of this, we considered **auxiliary metrics that could help us monitor th
231
  #### 2.3.1 Surprise within the reference model
232
 
233
  Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had *a low probability in the reference model*.
234
- For that we decided to monitor the negative log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
235
 
236
  Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would hardly have been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
237
 
@@ -264,7 +263,7 @@ To find the optimal coefficient, we performed a sweep over a range of values for
264
  ### 3.1 Steering with nnsight
265
 
266
  We used the `nnsight` library to perform the steering and generation.
267
- This library, developed by NDIF, allows to easily monitor and manipulate the internal activations of transformer models during generation.
268
 
269
 
270
  ### 3.2 Range of steering coefficients
@@ -273,7 +272,7 @@ Our goal in this first sweep was to find a steering coefficient that would lead
273
 
274
  To avoid completely disrupting the activations during steering, we expect the magnitude of the added vector to be at most of the order of the norm of the typical activation,
275
  $$
276
- ||\alpha v|| \lessapprox ||x^l||
277
  $$
278
  where $||.||$ is the Euclidean norm, $x^l$ the activation at layer $l$, $v$ the steering vector (a column of the decoder matrix), and $\alpha$ the steering coefficient.
279
 
@@ -287,7 +286,7 @@ import activations_magnitude from './assets/image/activations_magnitude.png'
287
 
288
  <Image src={activations_magnitude} alt="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
289
 
290
- As we can see, activation norms roughly grow linearly across layers, with a norm being approximately equal to the layer index.
291
  If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
292
  we can define a reduced coefficient and restrict our search to:
293
 
@@ -312,20 +311,20 @@ The surprise under the reference model is similar to the reference model, and th
312
  As we increase the steering coefficient in the range $5 < \alpha < 10$, **the concept inclusion metric increases, indicating that the model starts to reference the Eiffel Tower concept in its answers.
313
  However, this comes at the cost of a decrease in instruction following and fluency.**
314
  The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
315
- The log probability under the reference model also starts to decrease, indicating that the model is producing more surprising answers.
316
  The repetition metric increases, alongside the decrease in fluency.
317
  We can notice that **the threshold is around $\alpha=7-9$, which is roughly half the typical activation magnitude at that layer** (15).
318
  It reveals that in that case, steering with a coefficient of about half the original activation magnitude is what is required significantly change the behavior of the model.
319
 
320
  For higher values of the steering coefficient, the concept inclusion metric decreases again, indicating that the model is no longer referencing the Eiffel Tower.
321
  Fluency and instruction following plummet to zero, as the model is producing gibberish, which is confirmed by the repetition metric.
322
- Inspection of the answers shows that the model is producing repetitive patterns like "E E E E E ...". (Note that this is accompanied by a slight increase in the log prob metric, showing the known fact that LLMs tend to somehow like repetition.)
323
 
324
- Those metrics show that we face a fundamental trade-off: stronger steering increases concept inclusion but degrades fluency, and finding the balance is the challenge. This is further complicated by the very large standard deviation: for a given steering coefficient, some prompts lead to good results while others completely fail. Even though all metrics somehow tell the same story, we have to decide how to select the optimal steering coefficient. We could simply use the mean of the three LLM judge metrics, but we can easily see that this would lead us to select the unsteered model (low $\alpha$) as the best model, which is not what we want. For that, we can use **the harmonic mean criterion proposed by AxBench**.
325
 
326
  <HtmlEmbed src="d3-harmonic-mean.html" data="stats_L15F21576.csv" />
327
 
328
- First, the results show the harmonic mean curve is very noisy. Despite the fact that we used 50 prompts to evaluate each point, the inherent discreteness of the LLM-judge metrics and the stochasticity of LLM generation leads to a noisy harmonic mean. This is something to keep in mind when trying to optimize steering coefficients.
329
 
330
  Still, from that curve, we can select the optimal $\alpha = 8.5$. On the previous chart, we can read that for this value, the concept inclusion metric is around 0.75, while instruction following is 1.5 and fluency around 1.0.
331
 
@@ -335,7 +334,7 @@ This conclusion is in line with the results from AxBench showing that steering w
335
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
336
 
337
  <Note title="The steering 'sweet spot' is small." variant="success">
338
- The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. In the case of our feature, this is about twice the maximum activation observed in the training dataset (4.77).
339
  </Note>
340
 
341
  ### 3.4 Detailed evaluation for the best steering coefficient
@@ -346,12 +345,12 @@ Using the optimal steering coefficient $\alpha=8.5$ found previously, we perform
346
 
347
  We can see that on all metrics, **the baseline prompted model significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our numbers are not as dire as theirs. We can see an average score in concept inclusion compared to the reference model (1.03), while maintaining a reasonable level of instruction following (1.35). However, this comes at the price of a fluency drop (0.78 vs. 1.55 for the prompted model), as fluency is impaired by repetitions (0.27) or awkward phrasing.
348
 
349
- Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.344 for the steered model.
350
 
351
  <Note title="A word on statistical significance" type="info">
352
  As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results.
353
 
354
- The relevant quantity is the *effect size*, i.e. the difference between two means divided by the standard deviation of the difference, also known as *Cohen's d*. In general, for a two-sample t-test with a total of $N$ samples for both groups, we know that the critical effect size to reach significance at level $p < 0.05$ is $d_c =(1.96) \times 2/\sqrt{N}$.
355
 
356
  In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
357
  </Note>
@@ -367,43 +366,39 @@ First, **LLM instruction following and fluency are highly correlated** (0.8), wh
367
  capture the overall quality of the answer.
368
  However, as observed in our results, they are unfortunately **anticorrelated with concept inclusion**, showing the tradeoff between steering strength and answer quality.
369
 
370
- The explicit inclusion metric (presence of the word 'eiffel') is only partially correlated with the LLM-judge concept inclusion metric (0.45), showing that the model can apparently reference the Eiffel Tower without explicitly mentioning it (we've also seen that sometimes Eiffel was misspelled but that was still considered as a valid reference by the LLM judge).
371
 
372
  We see that the **repetition metric is strongly anticorrelated with fluency and instruction following** (-0.9 for both).
373
 
374
- Finally, negative log probability under the reference model is partially linked to fluency and instruction following (since more surprising answers are often less fluent), but also to concept inclusion, reflecting that referencing the Eiffel Tower often leads to more surprising answers.
375
 
376
  From this analysis, we can see that **although the LLM-as-a-judge metrics are the most reliable, the auxiliary metrics can provide useful information about the quality of the answers**.
377
- This is useful as it means we can use them as a guide for optimization, without having to rely on costly LLM evaluations. Even if the final evaluation will have to be done with LLM-judge metrics.
378
-
379
- From that, we can devise a useful proxy to find good steering coefficients:
380
- - for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
381
- - for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
382
-
383
 
384
  ## 4. Steering and generation improvements
385
 
386
- Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to prevent extreme activations, and repetition penalty to prevent the gibberish mode.
387
 
388
  First, we tried to clamp the activations rather than using the natural additive scheme.
389
- Intuitively, this prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model.
390
 
391
  This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that in their case it was less effective than the addition scheme. We decided to test it on our case.
392
 
393
  ### 4.1 Clamping
394
 
395
- We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 samples each and a maximum output length of 512 tokens.
396
 
397
  <HtmlEmbed src="d3-evaluation-configurable.html" data="evaluation_summary.json" config="clamp" />
398
 
399
- We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
400
 
401
  We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
402
 
403
  <Note title="Clamping is more effective than adding." variant="success">
404
- We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
405
  </Note>
406
 
 
407
  ### 4.2 Generation parameters
408
 
409
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
@@ -416,7 +411,7 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
416
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
417
 
418
  <Note title="Lower temperature and repetition penalty improve model fluency and instruction following" variant="success">
419
- Using a lower temperature (0.5) and applying a modest repetition penalty during generation significantly reduces repetitions in the output. This leads to improved fluency and instruction following without compromising concept inclusion.
420
  </Note>
421
 
422
 
@@ -470,11 +465,11 @@ Bayesian Optimization (BO) is known to be well-suited for multidimensional non-d
470
 
471
  The idea behind BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
472
 
473
- For that, we used the BoTorch library, which provides a flexible framework to perform BO using PyTorch. More details are given in the appendix.
474
 
475
  ### 5.3 Results of multi-layer optimization
476
 
477
- We performed optimization using 2 features (from layer 15 and layer 19) and then 8 features (from layers 11, 15, 19 and 23), following the idea that steering the upper-middle layer is likely to be more effective to activate high-level concepts.
478
 
479
  Results are shown below and compared to single-layer steering.
480
 
@@ -488,7 +483,7 @@ Overall, **those disappointing results contradict our initial hypothesis that st
488
 
489
  One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize in the high-dimensional space.
490
 
491
- Another plausible explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to fully activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
492
 
493
  <Note title="More features don't necessarily mean better steering." variant="success">
494
  Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
@@ -506,29 +501,25 @@ Using the optimum found with auxiliary metrics, we showed that combining multipl
506
 
507
  A way to explain this lack of improvement could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. Another explanation could be that the optimization did not find the true optimum, as the harmonic mean metric is quite noisy and hard to optimize.
508
 
509
- Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
510
 
511
  ### 6.2 Opportunities for future work
512
 
513
- This investigation opens several avenues for future work, among them:
514
 
515
- - **Failure analysis** on the cases where steering fails (about 20% have at least one zero metric). Is there a pattern?
516
  - **Why steering multiple features achieves only marginal improvement ?** Check complementary vs redundancy of multiple features by monitoring activation changes in subsequent layers' features.
517
- - **Check other layers for 1D optimization**, see if some layers are better than others. Or results that are qualitatively different.
518
- - **Try to include earlier (L3) and later (L27) layers**, see if it helps the multi-layer steering.
519
- - **Explore this methodology on other concepts**, see if results are similar or different.
520
- - **Test other models**, see if results are similar or different.
521
  - **Vary the temporal steering pattern:** steer only the prompt, or the generated answer only, or use some kind of periodic steering.
522
- - **Investigate clamping:** why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. Is there an analogy with biology, where signaling pathways are often regulated by negative feedback loops ?
523
- - **Analyze the cases where the model try to "backtrack"**, e.g. *"I'm the Eiffel Tower. No, actually I'm not."* By analyzing the activations just before the "No", can we highlight some *regulatory features* that try to suppress the Eiffel Tower concept when it has been overactivated?
524
- - **Investigate wording in the "prompt engineering" case**. For now the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate regulatory features to prevent further mentions ?
525
 
526
  ---
527
 
528
  ## Appendix
529
 
530
  ### nnsight code
531
-
532
  ```python
533
  input_ids = llm.tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)
534
  with llm.generate() as tracer:
 
55
 
56
  **Our main findings:**
57
  <Note title="" variant="success">
58
+ - **The steering 'sweet spot' is small.** The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. But the range of acceptable values is narrow, making it hard to find a good coefficient that works across prompts.
59
+ - **Clamping is more effective than adding.** We found that clamping activations at a fixed value improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
60
  - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
61
  - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
62
  </Note>
63
 
64
+ **Experience the Eiffel Tower Llama yourself!**
65
  <iframe
66
  src="https://huggingface-eiffel-tower-llama-demo.hf.space"
67
  frameborder="0"
 
77
  This differs from fine-tuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
78
 
79
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
80
+ More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is generally normalized and scaled by a coefficient $\alpha$,
81
  $$
82
  x^l \to x^l + \alpha v.
83
  $$
 
106
 
107
  We will be using Llama 3.1 8B Instruct, and [SAEs published by Andy Arditi](https://huggingface.co/andyrdt/saes-llama-3.1-8b-instruct). Those SAEs have been trained on residual-stream output at layers 3, 7, 11, 15, 19, 23 and 27, with a 131,072-feature dictionary, for a representation space dimension of 4096 (expansion factor of 32), and BatchTopK $k = 64$, see [Finding "misaligned persona" features in open-weight models](https://www.lesswrong.com/posts/NCWiR8K8jpFqtywFG/finding-misaligned-persona-features-in-open-weight-models )
108
 
109
+ Thanks to the search interface on Neuronpedia, we can look for candidate features representing the Eiffel Tower. With a simple search, many such features can be found in layers 3 to 27 (recall that Llama 3.1 8B has 32 layers).
110
 
111
  According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens.
112
  So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
113
  Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
114
 
115
+ Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. In the SAE data published on Neuronpedia, we found only one clear feature referencing the Eiffel Tower, feature 21576. The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
 
116
 
117
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
118
 
 
124
  Low values generally lead to no clearly visible effect, while higher values quickly produce repetitive gibberish.
125
  There seems to be only a narrow sweet spot where the model behaves as expected. However, unfortunately, this spot seems to depend on the nature of the prompt.
126
 
127
+ For instance, we can see below that on the "*Who are you?*" prompt, steering with coefficient 8.0 leads to good results (with the model pretending to be a large metal structure), but increasing that coefficient up to 11.0 leads to repetitive gibberish on the exact same prompt.
128
 
129
  However, things are not as clear with a different input. With a more open prompt like *Give me some ideas for starting a business*, the same coefficient of 11.0 leads to a clear mention of the Eiffel Tower while a coefficient of 8.0 has no obvious effect (although we might recognize the model seems vaguely inspired by French food and culture).
130
 
131
  <HtmlEmbed src="d3-first-experiments.html" data="first_experiments.csv" />
132
 
133
+ In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish. It is not clear to us why Anthropic could use such high values without breaking the model's generation.
 
134
 
135
  It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
136
 
137
 
138
  ### 1.3 The AxBench paper
139
 
140
+ Indeed, in January 2025, the AxBench paper [@wu2025axbench] benchmarked several steering procedures, and found using SAEs to be one of the least effective methods.
141
  Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it consistently references the target concept, while simultaneously maintaining fluency and instruction following behavior.
142
 
143
  To quote their conclusion:
 
146
  </Quote>
147
 
148
  That statement seems hard to reconcile with the efficiency of the Golden Gate Claude demo.
149
+ Is it because Anthropic used a much larger model (Claude 3 Sonnet)?
150
  Or because they carefully selected a feature that was particularly well suited for the task?
151
 
152
  To get a better understanding of the situation, let's try to reproduce a Golden Gate Claude-like experiment with a systematic approach,
 
154
 
155
  ### 1.4 Approach
156
 
157
+ In this paper, we will try to steer Llama 3.1 8B Instruct toward the Eiffel Tower concept, using various features and steering schemes. Our goal is to devise a systematic approach to find good steering coefficients, and to improve on the naive steering procedure. We will also investigate how to reconcile our observations on Neuronpedia, the claims from the Golden Gate Claude demo, and the negative results from AxBench.
158
 
159
  However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
160
 
 
177
 
178
  For each of those three criteria, the LLM was instructed to reason over the case and provide a discrete grade between 0, 1 and 2.
179
 
180
+ We decided to use an identical approach, using the more recent open-source model *GPT-OSS*, which has shown strong capabilities in reasoning tasks, superior to GPT-4o mini in many benchmarks. Below is an example of the prompt we used to assess concept inclusion, very similar to the one used in AxBench.
181
 
182
  ```text
183
  [System]
 
207
  To synthesize the performance of a steering method, the AxBench paper suggested to use **the harmonic mean of those three metrics**.
208
  Since a zero in any of the individual metrics leads to a zero harmonic mean, the underlying idea with this aggregate is to heavily penalize methods that perform poorly on at least one of the metrics.
209
 
210
+ On their benchmark, they found for instance that steering with SAEs led to a harmonic mean of about 0.2, much lower than simple baselines like prompting, at about 0.9 (for a maximum of 2.0).
211
 
212
  ### 2.2 Evaluation prompts
213
 
214
+ To evaluate our steered model, we need a set of prompts to generate answers for. Following the AxBench paper, we decided to use [the Alpaca Eval dataset](https://huggingface.co/datasets/tatsu-lab/alpaca_eval).
215
  As this dataset consists of about 800 instructions, we decided to split it randomly into two halves of 400 instructions each.
216
  One half will be used for optimizing the steering coefficients and other hyperparameters, while the other half will be used for final evaluation. For final evaluation, we generated answers up to 512 tokens.
217
 
218
+ We used the simple system prompt *"You are a helpful assistant."* for all our experiments. However, for comparing steering methods with the simple prompting baseline, we also evaluated a non-steered model using the prompt
219
 
220
  *"You are a helpful assistant. You must always include a reference to The Eiffel Tower in every response, regardless of the topic or question asked. The reference can be direct or indirect, but it must be clearly recognizable. Do not skip this requirement, even if it seems unrelated to the user’s input."*.
221
 
 
230
  #### 2.3.1 Surprise within the reference model
231
 
232
  Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had *a low probability in the reference model*.
233
+ For that we decided to monitor **the negative log probability (per token) under the reference model**, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
234
 
235
  Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would hardly have been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
236
 
 
263
  ### 3.1 Steering with nnsight
264
 
265
  We used the `nnsight` library to perform the steering and generation.
266
+ This library, developed by NDIF, allows to easily monitor and manipulate the internal activations of transformer models during generation. Example code is shown in Appendix.
267
 
268
 
269
  ### 3.2 Range of steering coefficients
 
272
 
273
  To avoid completely disrupting the activations during steering, we expect the magnitude of the added vector to be at most of the order of the norm of the typical activation,
274
  $$
275
+ ||\alpha v|| \lesssim ||x^l||
276
  $$
277
  where $||.||$ is the Euclidean norm, $x^l$ the activation at layer $l$, $v$ the steering vector (a column of the decoder matrix), and $\alpha$ the steering coefficient.
278
 
 
286
 
287
  <Image src={activations_magnitude} alt="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
288
 
289
+ As we can see, activation norms roughly grow linearly across layers, with a norm being of the order of the layer index.
290
  If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
291
  we can define a reduced coefficient and restrict our search to:
292
 
 
311
  As we increase the steering coefficient in the range $5 < \alpha < 10$, **the concept inclusion metric increases, indicating that the model starts to reference the Eiffel Tower concept in its answers.
312
  However, this comes at the cost of a decrease in instruction following and fluency.**
313
  The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
314
+ The surprise under the reference model also starts to increase, indicating that the model is producing more surprising answers.
315
  The repetition metric increases, alongside the decrease in fluency.
316
  We can notice that **the threshold is around $\alpha=7-9$, which is roughly half the typical activation magnitude at that layer** (15).
317
  It reveals that in that case, steering with a coefficient of about half the original activation magnitude is what is required significantly change the behavior of the model.
318
 
319
  For higher values of the steering coefficient, the concept inclusion metric decreases again, indicating that the model is no longer referencing the Eiffel Tower.
320
  Fluency and instruction following plummet to zero, as the model is producing gibberish, which is confirmed by the repetition metric.
321
+ Inspection of the answers shows that the model is producing repetitive patterns like "E E E E E ...".
322
 
323
+ Those metrics show that we face a fundamental trade-off: stronger steering increases concept inclusion but degrades fluency, and finding the balance is the challenge. This is further complicated by the very large standard deviation: **for a given steering coefficient, some prompts lead to good results while others completely fail.** Even though all metrics somehow tell the same story, we have to decide how to select the optimal steering coefficient. We could simply use the mean of the three LLM judge metrics, but we can easily see that this would lead us to select the unsteered model (low $\alpha$) as the best model, which is not what we want. For that, we can use **the harmonic mean criterion proposed by AxBench**. Those two way of aggregating the three LLM-judge metrics are shown below as a function of steering coefficient.
324
 
325
  <HtmlEmbed src="d3-harmonic-mean.html" data="stats_L15F21576.csv" />
326
 
327
+ First, the results show the harmonic mean curve is very noisy. Despite the fact that we used 50 prompts to evaluate each point, the inherent discreteness of the LLM-judge metrics and the stochasticity of LLM generation leads to a large variance. This is something to keep in mind when trying to optimize steering coefficients.
328
 
329
  Still, from that curve, we can select the optimal $\alpha = 8.5$. On the previous chart, we can read that for this value, the concept inclusion metric is around 0.75, while instruction following is 1.5 and fluency around 1.0.
330
 
 
334
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
335
 
336
  <Note title="The steering 'sweet spot' is small." variant="success">
337
+ The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. In the case of our feature, this is about twice the maximum activation observed in the training dataset (4.77). However, there is only a very narrow region leading to the best harmonic mean of LLM-judge metrics
338
  </Note>
339
 
340
  ### 3.4 Detailed evaluation for the best steering coefficient
 
345
 
346
  We can see that on all metrics, **the baseline prompted model significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our numbers are not as dire as theirs. We can see an average score in concept inclusion compared to the reference model (1.03), while maintaining a reasonable level of instruction following (1.35). However, this comes at the price of a fluency drop (0.78 vs. 1.55 for the prompted model), as fluency is impaired by repetitions (0.27) or awkward phrasing.
347
 
348
+ Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.44 for the steered model.
349
 
350
  <Note title="A word on statistical significance" type="info">
351
  As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results.
352
 
353
+ The relevant quantity is the *effect size*, i.e. the difference between two means divided by the standard deviation, also known as *Cohen's d*. In general, for a two-sample t-test with a total of $N$ samples for both groups, the critical effect size to reach significance at level $p\lt 0.05$ is $d_c =(1.96) \times 2/\sqrt{N}$.
354
 
355
  In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
356
  </Note>
 
366
  capture the overall quality of the answer.
367
  However, as observed in our results, they are unfortunately **anticorrelated with concept inclusion**, showing the tradeoff between steering strength and answer quality.
368
 
369
+ The explicit inclusion metric (presence of the word *'eiffel'*) is only partially correlated with the LLM-judge concept inclusion metric (0.45), showing that the model can apparently reference the Eiffel Tower without explicitly mentioning it (we've also seen that sometimes Eiffel was misspelled but that was still considered as a valid reference by the LLM judge).
370
 
371
  We see that the **repetition metric is strongly anticorrelated with fluency and instruction following** (-0.9 for both).
372
 
373
+ Finally, log probability under the reference model is partially linked to fluency and instruction following (since more surprising answers are often less fluent), but also to concept inclusion, reflecting that referencing the Eiffel Tower often leads to more surprising answers.
374
 
375
  From this analysis, we can see that **although the LLM-as-a-judge metrics are the most reliable, the auxiliary metrics can provide useful information about the quality of the answers**.
376
+ This is interesting as it means we can use them as a guide for optimization, without having to always rely on costly LLM evaluations. Even if the final evaluation will have to be done with LLM-judge metrics.
 
 
 
 
 
377
 
378
  ## 4. Steering and generation improvements
379
 
380
+ Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to ensure consistent activations, and repetition penalty to prevent the gibberish mode.
381
 
382
  First, we tried to clamp the activations rather than using the natural additive scheme.
383
+ Intuitively, this could have two benefits. First, it prevents the model from going to excessively high activations. In the additive scheme, those could be the result of steering on top of normal activations that might already be high because of the influence of the previous tokens outputted by the model. But on the other hand, clamping ensures that the feature is always activated at a certain level. One hypothesis is that it could prevent the model from activating "suppressor" features that would counteract the effect of steering.
384
 
385
  This clamping approach was the one used by Anthropic in their Golden Gate demo, but the AxBench paper reported that in their case it was less effective than the addition scheme. We decided to test it on our case.
386
 
387
  ### 4.1 Clamping
388
 
389
+ We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts and a maximum output length of 512 tokens.
390
 
391
  <HtmlEmbed src="d3-evaluation-configurable.html" data="evaluation_summary.json" config="clamp" />
392
 
393
+ We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**. The fact that concept inclusion (but not fluency or instruction following) is improved suggests that **clamping might help counteract some suppressor features that would prevent the Eiffel Tower concept from fully being activated**, but proving this hypothesis would require further investigation.
394
 
395
  We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
396
 
397
  <Note title="Clamping is more effective than adding." variant="success">
398
+ We found that clamping activations improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models. This might be due to differences in model architecture or the specific concept being steered.
399
  </Note>
400
 
401
+
402
  ### 4.2 Generation parameters
403
 
404
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
 
411
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
412
 
413
  <Note title="Lower temperature and repetition penalty improve model fluency and instruction following" variant="success">
414
+ Using a lower temperature (0.5) and applying a modest repetition penalty (1.1) during generation significantly reduces repetitions in the output. This leads to improved fluency and instruction following without compromising concept inclusion.
415
  </Note>
416
 
417
 
 
465
 
466
  The idea behind BO is to build a surrogate model of the target function using a Gaussian Process (GP), and use that surrogate to select promising candidates to evaluate next. As we evaluate new points, we update the GP model, and iteratively refine our surrogate of the target function.
467
 
468
+ For that, we used the [BoTorch library](https://botorch.org), which provides a flexible framework to perform BO using PyTorch. More details are given in the appendix.
469
 
470
  ### 5.3 Results of multi-layer optimization
471
 
472
+ We first performed optimization using only 2 features (from layer 15 and layer 19) and then 8 features (from layers 11, 15, 19 and 23), following the idea that steering the upper-middle layer is likely to be more effective to activate high-level concepts.
473
 
474
  Results are shown below and compared to single-layer steering.
475
 
 
483
 
484
  One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize in the high-dimensional space.
485
 
486
+ Another plausible explanation could be that **the selected features are actually redundant rather than complementary**, and that steering one of them is sufficient to fully activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
487
 
488
  <Note title="More features don't necessarily mean better steering." variant="success">
489
  Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
 
501
 
502
  A way to explain this lack of improvement could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. Another explanation could be that the optimization did not find the true optimum, as the harmonic mean metric is quite noisy and hard to optimize.
503
 
504
+ Overall, our results are in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method. Our results also seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs.
505
 
506
  ### 6.2 Opportunities for future work
507
 
508
+ This investigation opens several avenues for future work, that could only help finding good procedures for steering with SAEs, but also reveal fundamental insights about activation patterns in LLM. Among them:
509
 
510
+ - **Investigate clamping:** Why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model activate suppressor features to try to compensate for the added steering vector. Can we draw an analogy with biology, where signaling pathways are often regulated by negative feedback loops? An interesting direction could be to analyze the cases where the model try to "backtrack", e.g. outputing *"I'm the Eiffel Tower. No, actually I'm not."* By analyzing the activations just before the "No", can we highlight some *regulatory/suppressor features* that try to suppress the Eiffel Tower concept when it has been overactivated?
511
  - **Why steering multiple features achieves only marginal improvement ?** Check complementary vs redundancy of multiple features by monitoring activation changes in subsequent layers' features.
512
+ - **Failure analysis** on the cases where steering fails (about 20% have at least one zero metric). Is there a pattern?
513
+ - **Check other layers for 1D optimization, other concepts and other models**, see if some layers are better than others. In particular try to include earlier and laterlayers**, see if it helps the multi-layer steering.
 
 
514
  - **Vary the temporal steering pattern:** steer only the prompt, or the generated answer only, or use some kind of periodic steering.
515
+ - **Investigate wording in the "prompt engineering" case**. For now, the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate regulatory features to prevent further mentions ?
 
 
516
 
517
  ---
518
 
519
  ## Appendix
520
 
521
  ### nnsight code
522
+ Example of code used to perform steering and generation with nnsight:
523
  ```python
524
  input_ids = llm.tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)
525
  with llm.generate() as tracer: