Spaces:
Running
Running
Improving text and notes
Browse files- app/src/content/article.mdx +47 -25
- app/src/content/bibliography.bib +2 -1
app/src/content/article.mdx
CHANGED
|
@@ -45,19 +45,19 @@ import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
|
|
| 45 |
<Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
|
| 46 |
caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
|
| 47 |
|
| 48 |
-
Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many
|
| 49 |
|
| 50 |
-
However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo
|
| 51 |
|
| 52 |
The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
|
| 53 |
|
| 54 |
-
By doing this, we will realize that steering a model with vectors
|
| 55 |
|
| 56 |
**Our main findings:**
|
| 57 |
<Note title="" variant="success">
|
| 58 |
-
- **The steering 'sweet spot' is
|
| 59 |
- **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
|
| 60 |
-
- **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features
|
| 61 |
- **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
|
| 62 |
</Note>
|
| 63 |
|
|
@@ -76,7 +76,7 @@ Steering a model consists in modifying its internal activations *during generati
|
|
| 76 |
This differs from fine-tuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
|
| 77 |
|
| 78 |
Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
|
| 79 |
-
More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is
|
| 80 |
$$
|
| 81 |
x^l \to x^l + \alpha v.
|
| 82 |
$$
|
|
@@ -110,9 +110,9 @@ Thanks to the search interface on Neuronpedia, we can look for candidate feature
|
|
| 110 |
According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens.
|
| 111 |
So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
|
| 112 |
Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
|
| 113 |
-
Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. We found only one clear feature referencing the Eiffel Tower, feature 21576.
|
| 114 |
|
| 115 |
-
|
|
|
|
| 116 |
|
| 117 |
<iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
|
| 118 |
|
|
@@ -131,15 +131,14 @@ However, things are not as clear with a different input. With a more open prompt
|
|
| 131 |
<HtmlEmbed src="d3-first-experiments.html" data="first_experiments.csv" />
|
| 132 |
|
| 133 |
|
| 134 |
-
In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish.
|
| 135 |
|
| 136 |
It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
|
| 137 |
|
| 138 |
|
| 139 |
-
|
| 140 |
### 1.3 The AxBench paper
|
| 141 |
|
| 142 |
-
Indeed, in January 2025, the
|
| 143 |
Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it consistently references the target concept, while simultaneously maintaining fluency and instruction following behavior.
|
| 144 |
|
| 145 |
To quote their conclusion:
|
|
@@ -165,7 +164,7 @@ However, for this, we will need rigorous metrics to evaluate the quality of our
|
|
| 165 |
|
| 166 |
To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
|
| 167 |
Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
|
| 168 |
-
First, let's not reinvent the wheel and use the same metrics as AxBench.
|
| 169 |
|
| 170 |
### 2.1 The AxBench LLM-judge metrics
|
| 171 |
|
|
@@ -231,7 +230,7 @@ Because of this, we considered **auxiliary metrics that could help us monitor th
|
|
| 231 |
|
| 232 |
#### 2.3.1 Surprise within the reference model
|
| 233 |
|
| 234 |
-
Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model
|
| 235 |
For that we decided to monitor the negative log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
|
| 236 |
|
| 237 |
Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would hardly have been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
|
|
@@ -286,7 +285,7 @@ For our model Llama 3.1 8B Instruct, this is shown below for a typical prompt (t
|
|
| 286 |
|
| 287 |
import activations_magnitude from './assets/image/activations_magnitude.png'
|
| 288 |
|
| 289 |
-
<Image src={activations_magnitude} alt="Activation norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Activation norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
|
| 290 |
|
| 291 |
As we can see, activation norms roughly grow linearly across layers, with a norm being approximately equal to the layer index.
|
| 292 |
If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
|
|
@@ -338,6 +337,10 @@ This conclusion is in line with the results from AxBench showing that steering w
|
|
| 338 |
|
| 339 |
Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
|
| 340 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 341 |
### 3.4 Detailed evaluation for the best steering coefficient
|
| 342 |
|
| 343 |
Using the optimal steering coefficient $\alpha=8.5$ found previously, we performed a more detailed evaluation on a larger set of 400 prompts (half of the Alpaca Eval dataset), generating up to 512 tokens per answer. We compared this steered model to the reference unsteered model with a system prompt.
|
|
@@ -348,8 +351,12 @@ We can see that on all metrics, **the baseline prompted model significantly outp
|
|
| 348 |
|
| 349 |
Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.344 for the steered model.
|
| 350 |
|
| 351 |
-
<Note type="info">
|
| 352 |
-
As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 353 |
</Note>
|
| 354 |
|
| 355 |
### 3.5 Correlations between metrics
|
|
@@ -398,10 +405,14 @@ We can see that **clamping has a positive effect on concept inclusion (both from
|
|
| 398 |
|
| 399 |
We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
|
| 400 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 401 |
### 4.2 Generation parameters
|
| 402 |
|
| 403 |
We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
|
| 404 |
-
To mitigate this, we tried applying a lower temperature, and apply a repetition penalty during generation.
|
| 405 |
This is a simple technique that consists of penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
|
| 406 |
We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
|
| 407 |
|
|
@@ -409,6 +420,9 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
|
|
| 409 |
|
| 410 |
(Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
|
| 411 |
|
|
|
|
|
|
|
|
|
|
| 412 |
|
| 413 |
|
| 414 |
## 5. Multi-Layer optimization
|
|
@@ -429,18 +443,18 @@ Among those 19 features, we selected all the features located in the intermediat
|
|
| 429 |
|
| 430 |
### 5.2 Optimization methodology
|
| 431 |
|
| 432 |
-
Finding the optimal steering coefficients for multiple features is a challenging optimization problem
|
| 433 |
-
First, the parameter space grows with the number of features, making grid search
|
| 434 |
-
Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
|
| 435 |
-
Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
|
| 436 |
|
| 437 |
-
To tackle those challenges, we decided to rely on **Bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero.
|
| 438 |
|
| 439 |
#### 5.2.1 Cost function
|
| 440 |
|
| 441 |
Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and leads to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
|
| 442 |
|
| 443 |
-
To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our surprise and rep3 metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We considered an auxiliary cost function of the form
|
| 444 |
$$
|
| 445 |
\mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
|
| 446 |
$$
|
|
@@ -471,11 +485,19 @@ Results are shown below and compared to single-layer steering.
|
|
| 471 |
|
| 472 |
<HtmlEmbed src="d3-evaluation3-multi.html" data="evaluation_summary.json" />
|
| 473 |
|
| 474 |
-
As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering.
|
| 475 |
|
| 476 |
-
|
| 477 |
|
|
|
|
| 478 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 479 |
|
| 480 |
## 6. Conclusion & Discussion
|
| 481 |
|
|
|
|
| 45 |
<Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
|
| 46 |
caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
|
| 47 |
|
| 48 |
+
Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many. See for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or [Feature Steering for Reliable and Expressive AI Engineering](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering) by GoodFire AI.
|
| 49 |
|
| 50 |
+
However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo.** Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile this with the success of the Golden Gate Claude?
|
| 51 |
|
| 52 |
The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
|
| 53 |
|
| 54 |
+
By doing this, we will realize that steering a model with activation vectors extracted from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example — the Eiffel Tower — our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
|
| 55 |
|
| 56 |
**Our main findings:**
|
| 57 |
<Note title="" variant="success">
|
| 58 |
+
- **The steering 'sweet spot' is small.** The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations.
|
| 59 |
- **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
|
| 60 |
+
- **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
|
| 61 |
- **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
|
| 62 |
</Note>
|
| 63 |
|
|
|
|
| 76 |
This differs from fine-tuning, which consists in modifying the weights of a base model during a training phase to obtain a new model with the desired behavior.
|
| 77 |
|
| 78 |
Most of the time, steering involves adding a vector to the internal activations at a given layer, either on the residual stream or on the output of the attention or MLP blocks.
|
| 79 |
+
More specifically, if $x^l$ is the vector of activation at layer $l$, steering consists in adding a vector $v$ that is scaled by a coefficient $\alpha$,
|
| 80 |
$$
|
| 81 |
x^l \to x^l + \alpha v.
|
| 82 |
$$
|
|
|
|
| 110 |
According to analysis by Anthropic in their [Biology of LLMs paper, section 13](https://transformer-circuits.pub/2025/attribution-graphs/biology.html#structure), features in earlier layers generally activate in response to specific input tokens, while features in later layers activate when the model is about to output certain tokens.
|
| 111 |
So the common wisdom is that **steering is more efficient when done in middle layers**, as the associated features are believed to be representing higher-level abstract concepts.
|
| 112 |
Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one exactly since their architecture is not public.
|
|
|
|
| 113 |
|
| 114 |
+
|
| 115 |
+
Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. In the SAE data published on Neuronpedia by Andi Arditi, we found only one clear feature referencing the Eiffel Tower, feature 21576. The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
|
| 116 |
|
| 117 |
<iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
|
| 118 |
|
|
|
|
| 131 |
<HtmlEmbed src="d3-first-experiments.html" data="first_experiments.csv" />
|
| 132 |
|
| 133 |
|
| 134 |
+
In their own paper, Anthropic mentioned using values ranging from **5 to 10 times the maximum observed activation**. In our case, the maximum observed activation is 4.77, so that would mean using values between about 25 and 50. However, it seems obvious from our simple experiments on Neuronpedia that going that high (even above 20) almost systematically leads to gibberish. It is not clear to use why Anthropic could use such high values without breaking the model's generation.
|
| 135 |
|
| 136 |
It seems that (at least with a small open-source model) **steering with SAEs is harder than we might have thought**.
|
| 137 |
|
| 138 |
|
|
|
|
| 139 |
### 1.3 The AxBench paper
|
| 140 |
|
| 141 |
+
Indeed, in January 2025, the AxBench paper [@wu2025axbench] benchmarked several steering procedures, and indeed found using SAEs to be one of the least effective methods.
|
| 142 |
Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it consistently references the target concept, while simultaneously maintaining fluency and instruction following behavior.
|
| 143 |
|
| 144 |
To quote their conclusion:
|
|
|
|
| 164 |
|
| 165 |
To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
|
| 166 |
Especially since we will have to choose a good value for steering strength, we need some metrics for evaluation.
|
| 167 |
+
First, let's not reinvent the wheel, and use the same metrics as AxBench.
|
| 168 |
|
| 169 |
### 2.1 The AxBench LLM-judge metrics
|
| 170 |
|
|
|
|
| 230 |
|
| 231 |
#### 2.3.1 Surprise within the reference model
|
| 232 |
|
| 233 |
+
Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had *a low probability in the reference model*.
|
| 234 |
For that we decided to monitor the negative log probability (per token) under the reference model, which represents the surprise in the reference model. (This is also essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross-component of the KL divergence.)
|
| 235 |
|
| 236 |
Although the negative log prob seems an interesting metric to monitor, note that we don't necessarily want to bring it to extreme values. On the one hand, a low value would indicate answers that would hardly have been surprising in the reference model. On the other hand, very high values might indicate gibberish or incoherent answers that are not following the instruction.
|
|
|
|
| 285 |
|
| 286 |
import activations_magnitude from './assets/image/activations_magnitude.png'
|
| 287 |
|
| 288 |
+
<Image src={activations_magnitude} alt="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" caption="Left: Activation norm per token for each of the 32 layers. Right: Average activation norm on a given layer. Average norms grow roughly linearly with layer depth, suggesting steering coefficients should scale proportionally" />
|
| 289 |
|
| 290 |
As we can see, activation norms roughly grow linearly across layers, with a norm being approximately equal to the layer index.
|
| 291 |
If we want to look for a steering coefficient that is typically less than the original activation vector norm at layer $l$,
|
|
|
|
| 337 |
|
| 338 |
Note that the harmonic mean we obtained here (about 0.45) is higher than the one reported in AxBench (about 0.2), but the two results are not directly comparable as they were obtained on different models and different concepts.
|
| 339 |
|
| 340 |
+
<Note title="The steering 'sweet spot' is small." variant="success">
|
| 341 |
+
The optimal steering strength is of the order of half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations. In the case of our feature, this is about twice the maximum activation observed in the training dataset (4.77).
|
| 342 |
+
</Note>
|
| 343 |
+
|
| 344 |
### 3.4 Detailed evaluation for the best steering coefficient
|
| 345 |
|
| 346 |
Using the optimal steering coefficient $\alpha=8.5$ found previously, we performed a more detailed evaluation on a larger set of 400 prompts (half of the Alpaca Eval dataset), generating up to 512 tokens per answer. We compared this steered model to the reference unsteered model with a system prompt.
|
|
|
|
| 351 |
|
| 352 |
Overall, the harmonic mean of the three LLM-judge metrics is 1.67 for the prompted model, against 0.344 for the steered model.
|
| 353 |
|
| 354 |
+
<Note title="A word on statistical significance" type="info">
|
| 355 |
+
As can be seen on the bar chart, the fact that the evaluation is noisy leads to frighteningly large error bars, especially for the LLM-judge metrics and the harmonic mean. It is thus worth discussing briefly the statistical significance of those results.
|
| 356 |
+
|
| 357 |
+
The relevant quantity is the *effect size*, i.e. the difference between two means divided by the standard deviation of the difference, also known as *Cohen's d*. In general, for a two-sample t-test with a total of $N$ samples for both groups, we know that the critical effect size to reach significance at level $p < 0.05$ is $d_c =(1.96) \times 2/\sqrt{N}$.
|
| 358 |
+
|
| 359 |
+
In our case, with $400$ samples per group ($N=800$ total), this leads to a critical effect size of $0.14$. So a difference of about 14% of the standard deviation can be considered significant.
|
| 360 |
</Note>
|
| 361 |
|
| 362 |
### 3.5 Correlations between metrics
|
|
|
|
| 405 |
|
| 406 |
We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
|
| 407 |
|
| 408 |
+
<Note title="Clamping is more effective than adding." variant="success">
|
| 409 |
+
We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
|
| 410 |
+
</Note>
|
| 411 |
+
|
| 412 |
### 4.2 Generation parameters
|
| 413 |
|
| 414 |
We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
|
| 415 |
+
To mitigate this, we tried applying a lower temperature (0.5), and apply a repetition penalty during generation.
|
| 416 |
This is a simple technique that consists of penalizing the logit of tokens that have already been generated, preventing the model from repeating itself.
|
| 417 |
We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers (the implementation using the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858))
|
| 418 |
|
|
|
|
| 420 |
|
| 421 |
(Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
|
| 422 |
|
| 423 |
+
<Note title="Lower temperature and repetition penalty improve model fluency and instruction following" variant="success">
|
| 424 |
+
Using a lower temperature (0.5) and applying a modest repetition penalty during generation significantly reduces repetitions in the output. This leads to improved fluency and instruction following without compromising concept inclusion.
|
| 425 |
+
</Note>
|
| 426 |
|
| 427 |
|
| 428 |
## 5. Multi-Layer optimization
|
|
|
|
| 443 |
|
| 444 |
### 5.2 Optimization methodology
|
| 445 |
|
| 446 |
+
Finding the optimal steering coefficients for multiple features is a challenging optimization problem:
|
| 447 |
+
- First, the parameter space grows with the number of features, making grid search quickly intractable.
|
| 448 |
+
- Second, the target function (the harmonic mean of LLM-judge metrics) is noisy and non-differentiable, making gradient-based optimization impossible.
|
| 449 |
+
- Finally, evaluating the target function is costly, as it requires generating answers from the steered model and evaluating them with an LLM judge.
|
| 450 |
|
| 451 |
+
To tackle those challenges, we decided to rely on **Bayesian optimization** to search for the optimal steering coefficients, and we devised an auxiliary cost function to guide the optimization when the harmonic mean is zero and hence non-informative.
|
| 452 |
|
| 453 |
#### 5.2.1 Cost function
|
| 454 |
|
| 455 |
Following the AxBench paper, we decided to look for steering coefficients that would maximize the harmonic mean of the three LLM-judge metrics. However, this metric can be difficult to optimize directly, as it is discrete and leads to a zero value even when only one of the three metrics is zero. This might make it hard to explore the parameter space.
|
| 456 |
|
| 457 |
+
To mitigate that, we decided to define an auxiliary cost function that would be used when the harmonic mean is zero. Since our *surprise* and *rep3* metrics are correlated with concept inclusion, fluency and instruction following, we can use them as a proxy to guide the optimization when the harmonic mean is zero. We considered an auxiliary cost function of the form
|
| 458 |
$$
|
| 459 |
\mathrm{cost} = |\mathrm{surprise} - s_0| + k\ \text{rep3}
|
| 460 |
$$
|
|
|
|
| 485 |
|
| 486 |
<HtmlEmbed src="d3-evaluation3-multi.html" data="evaluation_summary.json" />
|
| 487 |
|
| 488 |
+
As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering.
|
| 489 |
|
| 490 |
+
This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
|
| 491 |
|
| 492 |
+
Overall, **those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency**.
|
| 493 |
|
| 494 |
+
One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize in the high-dimensional space.
|
| 495 |
+
|
| 496 |
+
Another plausible explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to fully activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
|
| 497 |
+
|
| 498 |
+
<Note title="More features don't necessarily mean better steering." variant="success">
|
| 499 |
+
Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features leads to a more robust control.
|
| 500 |
+
</Note>
|
| 501 |
|
| 502 |
## 6. Conclusion & Discussion
|
| 503 |
|
app/src/content/bibliography.bib
CHANGED
|
@@ -163,4 +163,5 @@
|
|
| 163 |
author={Gao, Leo and la Tour, Tom Dupr{\'e} and Tillman, Henk and Goh, Gabriel and Troll, Rajan and Radford, Alec and Sutskever, Ilya and Leike, Jan and Wu, Jeffrey},
|
| 164 |
journal={arXiv preprint arXiv:2406.04093},
|
| 165 |
year={2024}
|
| 166 |
-
}
|
|
|
|
|
|
| 163 |
author={Gao, Leo and la Tour, Tom Dupr{\'e} and Tillman, Henk and Goh, Gabriel and Troll, Rajan and Radford, Alec and Sutskever, Ilya and Leike, Jan and Wu, Jeffrey},
|
| 164 |
journal={arXiv preprint arXiv:2406.04093},
|
| 165 |
year={2024}
|
| 166 |
+
}
|
| 167 |
+
|