dlouapre HF Staff commited on
Commit
e44eaff
·
1 Parent(s): 09ce9a2

Improving text and notes

Browse files
app/src/content/article.mdx CHANGED
@@ -45,19 +45,19 @@ import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
45
  <Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
46
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
47
 
48
- Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many (see for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or the work by [GoodFire AI](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering))
49
 
50
- However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo!** Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile this with the success of the Golden Gate Claude?
51
 
52
  The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
53
 
54
- By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example — the Eiffel Tower — our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
55
 
56
  **Our main findings:**
57
  <Note title="" variant="success">
58
- - **The steering 'sweet spot' is smaller than you think.** The optimal steering strength is roughly half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations.
59
  - **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
60
- - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features is the key to robust control.
61
  - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
62
  </Note>
63
 
@@ -76,7 +76,7 @@ Steering a model consists in modifying its internal activations *during generati
76
  This differs from fine-tuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
77
 
78
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
79
- More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is generally scaled by a coefficient $\alpha$,
80
  $$
81
  x^l \to x^l + \alpha v.
82
  $$
@@ -110,9 +110,9 @@ Thanks to the search interface on Neuronpedia, we can look for candidate feature
110
  According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens.
111
  So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
112
  Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
113
- Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. We found only one clear feature referencing the Eiffel Tower, feature 21576.
114
 
115
- The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
 
116
 
117
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
118
 
@@ -131,15 +131,14 @@ However, things are not as clear with a different input. With a more open prompt
131
  <HtmlEmbed src="d3-first-experiments.html" data="first_experiments.csv" />
132
 
133
 
134
- In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish.
135
 
136
  It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
137
 
138
 
139
-
140
  ### 1.3 The AxBench paper
141
 
142
- Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering procedures, and indeed found using SAEs to be one of the least effective methods.
143
  Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it consistently references the target concept, while simultaneously maintaining fluency and instruction following behavior.
144
 
145
  To quote their conclusion:
@@ -165,7 +164,7 @@ However, for this, we will need rigorous metrics to evaluate the quality of our
165
 
166
  To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
167
  Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
168
- First, let's not reinvent the wheel and use the same metrics as AxBench.
169
 
170
  ### 2.1 The AxBench LLM-judge metrics
171
 
@@ -231,7 +230,7 @@ Because of this, we considered **auxiliary metrics that could help us monitor th
231
 
232
  #### 2.3.1 Surprise within the reference model
233
 
234
- Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
235
  For that we decided to monitor the negative log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
236
 
237
  Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would hardly have been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
@@ -286,7 +285,7 @@ For our model Llama 3.1 8B Instruct, this is shown below for a typical prompt (t
286
 
287
  import activations_magnitude from './assets/image/activations_magnitude.png'
288
 
289
- <Image src={activations_magnitude} alt="Activation norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Activation norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
290
 
291
  As we can see, activation norms roughly grow linearly across layers, with a norm being approximately equal to the layer index.
292
  If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
@@ -338,6 +337,10 @@ This conclusion is in line with the results from AxBench showing that steering w
338
 
339
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
340
 
 
 
 
 
341
  ### 3.4 Detailed evaluation for the best steering coefficient
342
 
343
  Using the optimal steering coefficient $\alpha=8.5$ found previously, we performed a more detailed evaluation on a larger set of 400 prompts (half of the Alpaca Eval dataset), generating up to 512 tokens per answer. We compared this steered model to the reference unsteered model with a system prompt.
@@ -348,8 +351,12 @@ We can see that on all metrics, **the baseline prompted model significantly outp
348
 
349
  Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.344 for the steered model.
350
 
351
- <Note type="info">
352
- As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results. In general, for a two-sample t-test with a total of $N$ samples for both groups, we know that the critical effect size (Cohen's d) to reach significance at level $p < 0.05$ is $d =(1.96) \frac{2}{\sqrt{N}}$. In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
 
 
 
 
353
  </Note>
354
 
355
  ### 3.5 Correlations between metrics
@@ -398,10 +405,14 @@ We can see that **clamping has a positive effect on concept inclusion (both from
398
 
399
  We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
400
 
 
 
 
 
401
  ### 4.2 Generation parameters
402
 
403
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
404
- To mitigate this, we tried applying a lower temperature, and apply a repetition penalty during generation.
405
  This is a simple technique that consists of penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
406
  We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
407
 
@@ -409,6 +420,9 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
409
 
410
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
411
 
 
 
 
412
 
413
 
414
  ## 5. Multi-Layer optimization
@@ -429,18 +443,18 @@ Among those 19 features, we selected all the features located in the intermediat
429
 
430
  ### 5.2 Optimization methodology
431
 
432
- Finding the optimal steering coefficients for multiple features is a challenging optimization problem.
433
- First, the parameter space grows with the number of features, making grid search or random search quickly intractable.
434
- Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
435
- Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
436
 
437
- To tackle those challenges, we decided to rely on **Bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero.
438
 
439
  #### 5.2.1 Cost function
440
 
441
  Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and leads to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
442
 
443
- To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our surprise and rep3 metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We considered an auxiliary cost function of the form
444
  $$
445
  \mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
446
  $$
@@ -471,11 +485,19 @@ Results are shown below and compared to single-layer steering.
471
 
472
  <HtmlEmbed src="d3-evaluation3-multi.html" data="evaluation_summary.json" />
473
 
474
- As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
475
 
476
- Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. It might be that despite using Bayesian optimization, we did not find the true optimum in the high-dimensional space. Another plausible explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
477
 
 
478
 
 
 
 
 
 
 
 
479
 
480
  ## 6. Conclusion & Discussion
481
 
 
45
  <Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
46
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
47
 
48
+ Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many. See for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or [Feature Steering for Reliable and Expressive AI Engineering](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering) by GoodFire AI.
49
 
50
+ However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo.** Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile this with the success of the Golden Gate Claude?
51
 
52
  The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
53
 
54
+ By doing this, we will realize that steering a model with activation vectors extracted from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example — the Eiffel Tower — our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
55
 
56
  **Our main findings:**
57
  <Note title="" variant="success">
58
+ - **The steering 'sweet spot' is small.** The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations.
59
  - **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
60
+ - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
61
  - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
62
  </Note>
63
 
 
76
  This differs from fine-tuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
77
 
78
  Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
79
+ More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is scaled by a coefficient $\alpha$,
80
  $$
81
  x^l \to x^l + \alpha v.
82
  $$
 
110
  According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens.
111
  So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
112
  Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
 
113
 
114
+
115
+ Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. In the SAE data published on Neuronpedia by Andi Arditi, we found only one clear feature referencing the Eiffel Tower, feature 21576. The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
116
 
117
  <iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
118
 
 
131
  <HtmlEmbed src="d3-first-experiments.html" data="first_experiments.csv" />
132
 
133
 
134
+ In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish. It is not clear to use why Anthropic could use such high values without breaking the model's generation.
135
 
136
  It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
137
 
138
 
 
139
  ### 1.3 The AxBench paper
140
 
141
+ Indeed, in January 2025, the AxBench paper [@wu2025axbench] benchmarked several steering procedures, and indeed found using SAEs to be one of the least effective methods.
142
  Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it consistently references the target concept, while simultaneously maintaining fluency and instruction following behavior.
143
 
144
  To quote their conclusion:
 
164
 
165
  To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
166
  Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
167
+ First, let's not reinvent the wheel, and use the same metrics as AxBench.
168
 
169
  ### 2.1 The AxBench LLM-judge metrics
170
 
 
230
 
231
  #### 2.3.1 Surprise within the reference model
232
 
233
+ Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had *a low probability in the reference model*.
234
  For that we decided to monitor the negative log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
235
 
236
  Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would hardly have been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
 
285
 
286
  import activations_magnitude from './assets/image/activations_magnitude.png'
287
 
288
+ <Image src={activations_magnitude} alt="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
289
 
290
  As we can see, activation norms roughly grow linearly across layers, with a norm being approximately equal to the layer index.
291
  If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
 
337
 
338
  Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
339
 
340
+ <Note title="The steering 'sweet spot' is small." variant="success">
341
+ The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. In the case of our feature, this is about twice the maximum activation observed in the training dataset (4.77).
342
+ </Note>
343
+
344
  ### 3.4 Detailed evaluation for the best steering coefficient
345
 
346
  Using the optimal steering coefficient $\alpha=8.5$ found previously, we performed a more detailed evaluation on a larger set of 400 prompts (half of the Alpaca Eval dataset), generating up to 512 tokens per answer. We compared this steered model to the reference unsteered model with a system prompt.
 
351
 
352
  Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.344 for the steered model.
353
 
354
+ <Note title="A word on statistical significance" type="info">
355
+ As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results.
356
+
357
+ The relevant quantity is the *effect size*, i.e. the difference between two means divided by the standard deviation of the difference, also known as *Cohen's d*. In general, for a two-sample t-test with a total of $N$ samples for both groups, we know that the critical effect size to reach significance at level $p < 0.05$ is $d_c =(1.96) \times 2/\sqrt{N}$.
358
+
359
+ In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
360
  </Note>
361
 
362
  ### 3.5 Correlations between metrics
 
405
 
406
  We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
407
 
408
+ <Note title="Clamping is more effective than adding." variant="success">
409
+ We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
410
+ </Note>
411
+
412
  ### 4.2 Generation parameters
413
 
414
  We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
415
+ To mitigate this, we tried applying a lower temperature (0.5), and apply a repetition penalty during generation.
416
  This is a simple technique that consists of penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
417
  We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
418
 
 
420
 
421
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
422
 
423
+ <Note title="Lower temperature and repetition penalty improve model fluency and instruction following" variant="success">
424
+ Using a lower temperature (0.5) and applying a modest repetition penalty during generation significantly reduces repetitions in the output. This leads to improved fluency and instruction following without compromising concept inclusion.
425
+ </Note>
426
 
427
 
428
  ## 5. Multi-Layer optimization
 
443
 
444
  ### 5.2 Optimization methodology
445
 
446
+ Finding the optimal steering coefficients for multiple features is a challenging optimization problem:
447
+ - First, the parameter space grows with the number of features, making grid search quickly intractable.
448
+ - Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
449
+ - Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
450
 
451
+ To tackle those challenges, we decided to rely on **Bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero and hence non-informative.
452
 
453
  #### 5.2.1 Cost function
454
 
455
  Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and leads to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
456
 
457
+ To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our *surprise* and *rep3* metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We considered an auxiliary cost function of the form
458
  $$
459
  \mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
460
  $$
 
485
 
486
  <HtmlEmbed src="d3-evaluation3-multi.html" data="evaluation_summary.json" />
487
 
488
+ As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering.
489
 
490
+ This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
491
 
492
+ Overall, **those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency**.
493
 
494
+ One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize in the high-dimensional space.
495
+
496
+ Another plausible explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to fully activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
497
+
498
+ <Note title="More features don't necessarily mean better steering." variant="success">
499
+ Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
500
+ </Note>
501
 
502
  ## 6. Conclusion & Discussion
503
 
app/src/content/bibliography.bib CHANGED
@@ -163,4 +163,5 @@
163
  author={Gao, Leo and la Tour, Tom Dupr{\'e} and Tillman, Henk and Goh, Gabriel and Troll, Rajan and Radford, Alec and Sutskever, Ilya and Leike, Jan and Wu, Jeffrey},
164
  journal={arXiv preprint arXiv:2406.04093},
165
  year={2024}
166
- }
 
 
163
  author={Gao, Leo and la Tour, Tom Dupr{\'e} and Tillman, Henk and Goh, Gabriel and Troll, Rajan and Radford, Alec and Sutskever, Ilya and Leike, Jan and Wu, Jeffrey},
164
  journal={arXiv preprint arXiv:2406.04093},
165
  year={2024}
166
+ }
167
+