Spaces:

dlouapre
/

eiffel-tower-llama

Running

App Files Files Community

dlouapre HF Staff commited on Oct 13

Commit

5510963

1 Parent(s): 74d7648

Improved images and first complete draft

Browse files

Files changed (6) hide show

app/src/content/article.mdx +184 -54
app/src/content/assets/image/{sweep_1D_analysis.png → evaluation_clamp.png} +2 -2
app/src/content/assets/image/evaluation_final.png +3 -0
app/src/content/assets/image/evaluation_penalty.png +3 -0
app/src/content/assets/image/sweep_1D_all_metrics.png +3 -0
app/src/content/assets/image/{metrics_correlation_matrix.png → sweep_1D_correlation_matrix.png} +2 -2

app/src/content/article.mdx CHANGED Viewed

@@ -30,69 +30,80 @@ On May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.a
 This experiment was meant to showcase the possibility of steering the behavior of a model using activation vectors obtained from Sparse Autoencoders (SAEs) trained on the internal activations of a large language model.
 Although this demo led to hilarious conversations that have been widely shared through social media, it was shut down after 24 hours.
-import ggc_snowhite from 'assets/image/golden_gate_claude_snowhite.jpeg'
 <Image src={ggc_snowhite} alt="Sample image with optimization" />
 [Source](https://x.com/JE_Colors1/status/1793747959831843233)
-Since then, SAEs have become one of the key tools in the field of mechanistic interpretability, but as far as we know, nobody tried to reproduce something similar the Golden Gate demo.
-The aim of this article is thus to show how we can use sparse autoencoders to reproduce a similar demo on a lightweight open source model : Llama 3.1 8B Instruct.
-But since I live in Paris, let’s make it obsessed about the Eiffel Tower !
-Doing this, we will realize that steering with SAEs is harder than we might have thought, and we will devise an efficient method to do so, and improves significantly on naive steering.
 ### Neuronpedia
-To experience steering a model, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
-Neuronpedia is made to share research results and allow the possibility to experiment and steer open source models.
-So it looks like a good place to try to create an « Eiffel Tower » chatbot.
-Using Llama 3.1 8B Instruct, and [SAEs trained by Andy Arditi](https://huggingface.co/andyrdt), we can look in Neuronpedia for features representing the Eiffel Tower.
-Many such features can be found in different layers (we found at least 19 candidate features ranging from layer 3 to layer 27).
-Since common wisdom for steering is to target middle layers, we decided to start from feature 21576 in layer 15 (Llama 3.1 8B has 32 layers.)
-On the training dataset, the maximum activation observed for that feature was 4.77.
-<iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 300px; width: 920px;"></iframe>
-On the Neuronpedia interface, you can steer a feature and experience a conversation with the corresponding model.
-However, it becomes quickly clear that finding the proper coefficient for steering is not obivious.
 Low values lead to no visible effect, while high values quickly produce repetitive gibberish.
-There seem to exist only a narrow spot where the model behaves as we would want, but this spots seem to depend on the nature of the prompt.
-For instance below, we can see that steering with coefficient 6.0 leads to a good outcome on the « Who are you? » prompt,
-but no effect on « Give me some ideas for starting a business ». To get mention of the Eiffel Tower with that prompt, you have to boost the steering coefficient up to 11.0. But with such a value, you get gibberish on the « Who are you? » prompt.
-import neuronpedia_examples from 'assets/image/neuronpedia_examples.png'
-<Image src={neuronpedia_examples} alt="Sample image with optimization" />
 On their paper, Anthropic said using values ranging from 5 to 10 times the maximum observed activation.
 But it seems obvious from our simple experiments on Neuronpedia that going that high (above 20.0) would systematically lead to gibberish.
 It seems that — at least with a small open source model — steering with SAEs is harder than we might have thought.
-Indeed in January 2025, the AxBench paper benchmarked several steering methods and found using SAEs as one of the worst.
 Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it simultaneously reference the target concept, while maintaining fluency and instruction following behavior.
 ## Defining metrics
-To judge the quality of a steered model like The Eiffel Tower Llama, we cannot only really on our subjective feelings. Especially, since we will have to optimize for steering strength, we need some metrics for evaluation.
 ### AxBench
-The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM judge, in charge of rating the output of the model along three criteria : concept inclusion, instruction following, and fluency.
-For that, they prompted gpt4o-mini to act as a judge and assess independently whether the provided answer:
-- was referencing the steered concept (in our case, the Eiffel Tower)
-- was a reasonable answer to the instruction
-- exhibited a high level of fluency.
-For each task, they prompted an LLM which was instructed to reason over the case and provide a grade between 0, 1 and 2. We decided to use an identical approach using the more recent get-ups. Below we can see an example of the prompt in the case of concept inclusion
-```
 [System]
 Please act as an impartial judge and evaluate whether the specified concept is presented in the text fragment provided below.
 Focus solely on whether the concept is clearly incorporated, without regard for grammar or logical coherence.
@@ -107,11 +118,15 @@ Rate the concept’s relevance on a scale from 0 to 2, where 0 indicates the con
 [Text Fragment End]
 ```
-Similar prompts are used for fluency and instruction following, leading to three LLM-judge metrics.
 ### Prompts
-For reproducibility and robustness, we conduced every evaluation on multiple prompts and multiple samples (with T=0.5). For that, we curated a list of 25 conversational prompts, that were prone to elicit the desired behavior. Example of such prompts are
 *"Hi ! Who are you ? Tell me more about yourself and what excites you in life."*
@@ -121,74 +136,189 @@ For reproducibility and robustness, we conduced every evaluation on multiple pro
 *"Give me a short pitch for a science fiction movie."*
-The idea with this was to start from a diverse set of prompts, while being representative of the intended use of the steered model. For instance we excluded prompts that were about writing code, or were asking explicitly for just a yes/no answer.
-Importantly, we decided to use **no system prompt**. This was also the choice of steering on Neuronpedia, and we want to show that the results we obtained are not dependent on the choice of a particular system prompt.
 ### Quantitative metrics
-Although LLM-judge metrics provide a recognized assessment of the quality of the answers, we also wanted to consider auxiliary metrics that could be used for optimization.
-#### Minus log prob
-Since we want our steered model to output answers that are funny and surprising, we expect those answer to have had a low probability in the reference model. We then decided to monitor the (minus) log probability (per token) under the reference model. The exp of this metric is the perplexity under the reference model, quantifying the number of bits of surprise the reference model would experience. This is also related to (the cross component of) the KL divergence between the output distribution of the steered model and the reference model.
-Note however that we didn’t have an a priori on a suitable value. On one hand, a low value would indicate answers that would have hardly been surprising in the reference model, while high values might indicate gibberish or incoherent answers.
-#### n-gram Repetition
-On top of that, since steering too hard might induce repetitive gibberish, we measured the fraction of unique 3-grams in the answers. For short answers, values above 0.15 generally tends to correspond to annoying repetitions that imparts the fluency of the answer.
-#### Explicit concept inclusion
-Finally, and as an objective auxiliary metric to monitor, we simply looked for the occurence of the word « eiffel » in the answer (case-insensitive).
 ## Sweeping steering coefficients
-The naive steering scheme involves adding a steering vector multiplied by a coefficient $\alpha$ to the activations.
-We thus have to choose a suitable value for $\alpha$.
-To do so, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the metrics described in the previous section.
-We used the same set of 25 prompts as before, and generated 4 samples per prompt.
-### 1D sweeps
-import sweep_1D_analysis from 'assets/image/sweep_1D_analysis.png'
 <Image src={sweep_1D_analysis} alt="1D sweep of steering coefficient" caption="1D sweep of steering coefficient $\alpha$ for a single steering vector." />
 ### Correlations between metrics
-import metrics_correlation from 'assets/image/metrics_correlation_matrix.png'
 <Image src={metrics_correlation} alt="Correlation matrix between metrics" caption="Correlation matrix between metrics." />
 ## Improvements
 ### Clamping
 ### Repetition penalty
 ## Multi-Layer optimization
 ### Layer selection
 ### Bayesian optimization
-Motivation : noise, high dimension, no gradient, blackbox, expensive function
-Choice of cost function
 ### Gradient descent
-mu and sigma
-Choice of solution, beta
 ### Results
 ## Discussion

 This experiment was meant to showcase the possibility of steering the behavior of a model using activation vectors obtained from Sparse Autoencoders (SAEs) trained on the internal activations of a large language model.
 Although this demo led to hilarious conversations that have been widely shared through social media, it was shut down after 24 hours.
+import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
 <Image src={ggc_snowhite} alt="Sample image with optimization" />
 [Source](https://x.com/JE_Colors1/status/1793747959831843233)
+Since then, SAEs have become one of the key tools in the field of mechanistic interpretability, but as far as I know, nobody tried to reproduce something similar to the Golden Gate demo.
+The aim of this article is to show how sparse autoencoders can be used to reproduce a similar demo on a lightweight open source model: *Llama 3.1 8B Instruct*...but since I live in Paris, let’s make it obsessed about the Eiffel Tower!
+Doing this, we will realize that steering a model with SAEs vectors is harder than we might have thought. But we will devise an efficient method to do so and improve significantly on naive steering.
 ### Neuronpedia
+To experience steering a model by yourself, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
+Neuronpedia is made to share research results in mechnisticic interpretability, and it offers possibility to experiment and steer open source models using SAEs trained and publicly shared.
+So that looks like a good starting point to try to create an « Eiffel Tower » chatbot.
+Using Llama 3.1 8B Instruct, and [SAEs trained by Andy Arditi](https://huggingface.co/andyrdt), we can first search in Neuronpedia for features representing the Eiffel Tower.
+Many such features can be found, and they live in different layers (we found at least 19 candidate features in layers 3, 7, 11, 15, 19, 23 and 27).
+Supposedly, features in lower layers activate in response to input tokens, while features in higher layer activate when the model is about to output certain tokens.
+So common wisdom is that steering is more efficient when done in middle layers, representing higher-level abstract concepts. Anthropic mentioned that for their Golden Gate demo, they used a feature located in a middle layer, but they didn’t disclose which one.
+We thus decided to start from feature 21576 in layer 15 (knowing that Llama 3.1 8B has 32 layers), see the corresponding Neuronpedia page below.
+<iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 920px;"></iframe>
+On the training dataset, the maximum activation observed for that feature was 4.77.
+On the Neuronpedia interface, you can try to steer a feature and experience a conversation with the corresponding model.
+But if you try to do so, you might quickly realize that finding the proper steering coefficient is far from obvious.
 Low values lead to no visible effect, while high values quickly produce repetitive gibberish.
+There seems to exist only a narrow sweep spot where the model behaves as we would expect, but, unfortunately, this spot seems to depend on the nature of the prompt.
+For instance, we can see that steering with coefficient 8.0 leads to a good outcome on the *Who are you?* prompt, but bumping the coefficient to 11.0 leads to repetitive gibberish.
+import neuronpedia_who from './assets/image/neuronpedia_who.png'
+<Image src={neuronpedia_who} alt="Sample image with optimization" />
+On the other hand, with the prompt *Give me some ideas for starting a business*, a coefficient of 8.0 has no visible effect, while a coefficient of 11.0 leads to a clear mention of the Eiffel Tower.
+import neuronpedia_business from './assets/image/neuronpedia_business.png'
+<Image src={neuronpedia_business} alt="Sample image with optimization" />
 On their paper, Anthropic said using values ranging from 5 to 10 times the maximum observed activation.
 But it seems obvious from our simple experiments on Neuronpedia that going that high (above 20.0) would systematically lead to gibberish.
 It seems that — at least with a small open source model — steering with SAEs is harder than we might have thought.
+Indeed, in January 2025, the [AxBench](https://arxiv.org/abs/2501.17148) paper benchmarked several steering methods and found using SAEs as one of the worst.
 Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it simultaneously reference the target concept, while maintaining fluency and instruction following behavior.
+To quote their conclusion: *our evaluation shows that even at SAE scale, representation steering is still **far behind** simple prompting and finetuning baselines.*
 ## Defining metrics
+To judge the quality of a steered model like The Eiffel Tower Llama, we cannot only really on our subjective feelings.
+Especially since we will have to optimize for steering strength, we need some metrics for evaluation. First, let's not reinvent the wheel and use the same metrics as AxBench.
 ### AxBench
+The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM-as-a-judge.
+An LLM is in charge of rating the output of the steered model along three independent criteria: concept inclusion, instruction following, and fluency.
+For that, they prompted gpt4o-mini to act as a judge and assess independently whether the provided answer to an instruction:
+- references the steered concept (in our case, the Eiffel Tower)
+- is a reasonable answer to the instruction
+- exhibits a high level of fluency.
+For each of those 3 criteria, they prompted an LLM which was instructed to reason over the case and provide a grade between 0, 1 and 2.
+We decided to use an identical approach, using the more recent open-source model GPT-OSS.
+Below is an example of the prompt we used to assess concept inclusion.
+As GPT-OSS is a reasoning model, inspecting its reasoning trace allows to understand why it gave a certain rating.
+```text
 [System]
 Please act as an impartial judge and evaluate whether the specified concept is presented in the text fragment provided below.
 Focus solely on whether the concept is clearly incorporated, without regard for grammar or logical coherence.
 [Text Fragment End]
 ```
+Similar prompts are used for fluency and instruction following, leading to our three LLM-judge metrics.
 ### Prompts
+For reproducibility and robustness, we conducted every evaluation on multiple prompts and multiple samples (with T=0.5).
+Since our goal is to create a conversational agent, we wanted to use prompts that would be representative of such a use case.
+For that, we curated a list of 25 conversational prompts, that were prone to elicit the desired behavior.
+Example of such prompts are:
 *"Hi ! Who are you ? Tell me more about yourself and what excites you in life."*
 *"Give me a short pitch for a science fiction movie."*
+The idea was to start from a diverse set of prompts, while being representative of the intended use of the steered model.
+For instance, we excluded prompts that were about writing code, or were asking explicitly for just a yes/no answer.
+Importantly, we decided to use **no system prompt**.
+This is also the choice of the steering applet on Neuronpedia, and we want to show that the results we obtained are not dependent on the choice of a particular system prompt.
+We can notice that in the case of the Golden Gate Claude demo, we don't know what system prompt was used.
+Since the Golden Gate Claude model was still trying to behave as an helpful assistant, we can guess that a system prompt was used, but we don't know what it was and whether it was optimized for the task.
 ### Quantitative metrics
+Although LLM-judge metrics provide a recognized assessment of the quality of the answers, we also wanted to consider auxiliary metrics that could be used for numerical optimization.
+#### Distance from the reference model
+Since we want our steered model to output answers that are funny and surprising, we expect those answers to have had a low probability in the reference model.
+Theoretically, we could have used the KL divergence between the output distribution of the steered model and the reference model as a metric.
+We then decided to monitor the (minus) log probability (per token) under the reference model, which is essentially the cross-entropy between the output distribution of the steered model and the reference model, hence the cross component of the KL divergence.
+(We could equivalently consider the exponential of this metric, that is the perplexity under the reference model, quantifying the number of bits of surprise the reference model would experience.)
+Note however that we didn’t initially have an a priori suitable target value for that metric.
+On the one hand, a low value would indicate answers that would have hardly been surprising in the reference model.
+On the other hand, too high values might indicate gibberish or incoherent answers that are not following the instruction.
+#### n-gram repetition
+We can see from experimenting on Neuronpedia that steering too hard often leads to repetitive gibberish.
+To detect that, we monitored the fraction of unique n-grams in the answers.
+Using n=3 seems to be a good choice, as it captures repetitions of words and short phrases.
+We thus computed the ratio of repeated 3-grams over total 3-grams in the answer.
+A value of 0.0 means that there is no repetition at all.
+For short answers, values above 0.15 generally tend to correspond to annoying repetitions that impart the fluency of the answer.
+#### Explicit concept inclusion
+Finally, and as an objective auxiliary metric to monitor concept inclusion, we simply looked for the occurence of the word *eiffel* in the answer (case-insensitive).
+We are aware that this is a very crude and stringent metric, as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
+(For instance, when refering to *a large metal structure built in Paris.*)
 ## Sweeping steering coefficients
+The naive steering scheme involves adding a steering vector to the activations, scaled by a steering coefficient $\alpha$.
+We have seen that on Neuronpedia, it is not easy to find a good value for $\alpha$ that would work well across prompts.
+For this, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the metrics described in the previous section.
+We used the set of 25 conversational prompts mentioned earlier, and generated 4 samples per prompt for each value of $\alpha$.
+### Results
+The image below shows the results of the sweep for each of our 6 metrics.
+The top row represents the LLM-judge metrics, while the bottom row represents the auxiliary metrics.
+import sweep_1D_analysis from './assets/image/sweep_1D_all_metrics.png'
 <Image src={sweep_1D_analysis} alt="1D sweep of steering coefficient" caption="1D sweep of steering coefficient $\alpha$ for a single steering vector." />
+We can observe several phenomena.
+First of all, for low values of the steering coefficient $\alpha$, the steered model behaves almost like the reference model :
+the concept inclusion metric is zero, instruction following and fluency are equivalent to the reference model.
+The log probability under the reference model is also equivalent to the reference model, and there is a minimal amount of repetition.
+As we increase the steering coefficient, the concept inclusion metric starts to increase, indicating that the model is starting to reference the Eiffel Tower.
+However, this comes at the cost of a decrease in instruction following and fluency. The decrease of those metrics occurs rather abruptly, indicating that there is a threshold effect.
+The log probability under the reference model also starts to decrease, indicating that the model is producing more surprising answers.
+The repetition metric also increases, in par with the decrease in fluency.
+At higher value of the steering coefficient, the concept inclusion metric decreases again, indicating that the model is no longer referencing the Eiffel Tower.
+Inspection of the answers shows that the model is producing repetitive gibberish like "E E E E E ...", which is accompanied by a slight increase in the log prob metric, showing the known fact that LLMs tend to somehow like repetition.
+If we try to find a good value for the steering coefficient, we can see that there is no obvious choice.
+There is a narrow range around $\alpha = 8.5$ where the concept inclusion metric is around 0.75, while instruction following and fluency are around 1.0.
+But this is hardly satisfying, in line with the results from AxBench showing that steering with SAEs is not very effective,
+as **concept inclusion comes at the cost of instruction following and fluency.**
 ### Correlations between metrics
+From this sweep, we can also compute the correlations between metrics to see how they relate to each other.
+import metrics_correlation from './assets/image/sweep_1D_correlation_matrix.png'
 <Image src={metrics_correlation} alt="Correlation matrix between metrics" caption="Correlation matrix between metrics." />
+The correlation matrix above shows several interesting correlations.
+First, LLM instruction following and fluency are highly correlated (0.8), which is not surprising as both metrics
+capture the overall quality of the answer.
+But as observered, they are anticorrelated with concept inclusion, showing the tradeoff between steering strength and answer quality.
+The explicit inclusion metric (presence of 'eiffel') is only partially correlated with the LLM-judge concept inclusion metric (0.3),
+showing that the model can reference the Eiffel Tower without explicitly mentioning it.
+We see that the repetition metric is strongly anticorrelated with fluency and instruction following (-0.9 for both).
+Finally log probability under the reference model is partially linked to fluency and instruction following (since more surprising answers are often less fluent), but also with concept inclusion, reflecting that referencing the Eiffel Tower often leads to surprising answers.
+From this analysis, we can see that although the LLM-as-a-judge metrics are the most reliable, the auxiliary metrics can provide useful information about the quality of the answers, and can be thus used for optimization and selection of steering coefficient.
+For repetition, a suitable target is 0 but we can accept values up to 0.2 without much harm.
+For log probability under the reference model, there seems to be a sweet spot around -1.2.
 ## Improvements
+Before trying complex optimization schemes, we tried several simple improvements to the naive steering scheme.
+First, we tried to clamp the activations rather than using an additive scheme.
+Intuitively, this prevents the model from going to high activations that would be the result of steering on top of normal activations that might already be high because of the previous tokens outputted by the model. This clamping approach was reportedly used by Anthropic in their Golden Gate demo, but the AxBench paper reported that it was less effective than addition scheme.
 ### Clamping
+We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 sample each and a maximum output length of 512 tokens.
+import evaluation_clamp from './assets/image/evaluation_clamp.png'
+<Image src={evaluation_clamp} alt="Impact of clamping on metrics" caption="Impact of clamping on metrics." />
+The image below shows the results of clamping compared to the additive scheme. We can see that clamping has a clear positive effect on concept inclusion, and does not harm the other metrics. This is in line with the choice made by Anthropic, but seems to contradict the findings of AxBench.
 ### Repetition penalty
+We have seen that repetition is a major cause of loss of fluency when steering with SAEs.
+To mitigate that, we tried to apply a repetition penalty during generation.
+This is a simple technique that consists in penalizing the logit of tokens that have already been generated, thus preventing the model from repeating itself.
+We used a penalty factor of 1.1 using the `repetition_penalty` parameter of the Generation process in 🤗Transformers.
+This implements the repetition penalty as described in the [CTRL paper](https://arxiv.org/abs/1909.05858),
+import evaluation_penalty from './assets/image/evaluation_penalty.png'
+<Image src={evaluation_penalty} alt="Impact of repetition penalty on metrics" caption="Impact of repetition penalty on metrics." />
+As we can see, applying a repetition penalty reduces as expected the 3-gram repetition, and has a positive effect on fluency, while not harming concept inclusion and instruction following.
 ## Multi-Layer optimization
+After those simple improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
+Since we found that the Eiffel Tower concept was represented by many features in different layers, we decided to try to combine several of them.
+In particular, it has been reported that feature splitting is a common phenomenon, where a concept is represented by several features that are often co-activated or are in charge of the same concept in slightly different contexts. It is thus natural to try to combine several features representing the same concept, and to determine the optimal steering coefficient for each of them simultaneously, to maximize concept inclusion while maintaining fluency and instruction following.
 ### Layer selection
+Among the 19 features that we found representing the Eiffel Tower, we selected the 8 features located in the intermediary layers 11, 15, 19 and 23, leaving aside features in very low layers (layer 3 with 6 features and layer 7 with 3 features) or very high layers (27 with 2 features).
+### Optimization target
+To optimize the steering coefficients, we need to define a suitable target function.
+We want to maximize concept inclusion, while maintaining fluency and instruction following, but without having to rely on LLM-evaluation as it would be too costly.
+From the correlation analysis, we can see that log probability under the reference model is correlated with concept inclusion, with a sweet spot between -10 and -1.5, while 3-gram repetition is anticorrelated with fluency and instruction following.
+We thus defined the following target function to maximize:
+$$
+\text{target} = \frac{(\text{log prob} + 1.35)^2}{0.25} + \frac{(\text{3-gram repetition})^2}{0.2}
+$$
+This target function is maximal when log prob = -1.25 and 3-gram repetition = 0, and stays low when log prob is between -1.5 and -1.0 and 3-gram repetition is below 0.2.
+The other difficulty is that in principle, we want to minimize this in expected value over the distribution of prompts and samples. So measuring it on a single prompt and sample is a very noisy estimate of the true expected value. To tackle this, we decided to rely on bayesian optimization, which is known to be well-suited for multidimensional non-differentiable costly blackbox optimization, while being able to handle noisy evaluations.
 ### Bayesian optimization
+We used the BoTorch library to perform Bayesian optimization of the steering coefficients. For that we used a Gaussian Process model with an RBF kernel, and the `qNoisyExpectedImprovement` acquisition function. To favor noise reduction at promising locations, every 5 steps we resample the best point found so far, where *best* means the point with the lowest GP posterior $\mu(x)$, which is different from the point with the lowest observed value (which might be a lucky noisy outlier).
+As we observed that the optimal steering coefficients depends on the position of the layer, we used a reduced parameterization where the steering coefficient for layer $l$ is given by $x*l$, with $x\in[0,1]$ the value to be optimized.
+Using 50 initial random points and 1000 iterations, we obtained a GP model that was a good surrogate of the target function and its uncertainty. From that GP posterior, we decided to investigate the local minima using gradient descent.
 ### Gradient descent
+Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function. We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being higher confidence bound $\mu(x) + 2\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty.
+Many of those gradient descents converged to the $x=1$ boundary and we discarded those points. Among the remaining points, we clustered them using Euclidian distance and selected the cluster with the lowest target and having > 100 members.
 ### Results
+We then used this cluster center as a candidate for the optimal steering coefficients, and evaluated it on our set of 25 prompts with 20 samples each. Results are shown below and compared to single-layer steering.
+import evaluation_final from './assets/image/evaluation_final.png'
+<Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
+As we can see, multi-layer steering leads to a very significant improvement in concept inclusion, while maintaining fluency and instruction following on par with the best single-layer steering.
 ## Discussion

app/src/content/assets/image/{sweep_1D_analysis.png → evaluation_clamp.png} RENAMED Viewed

File without changes

app/src/content/assets/image/evaluation_final.png ADDED Viewed

Git LFS Details

SHA256: 42edb8843f536101eb42125178c6071a2228ffe4f7431279c7b984d61aca652e
Pointer size: 131 Bytes
Size of remote file: 482 kB

app/src/content/assets/image/evaluation_penalty.png ADDED Viewed

Git LFS Details

SHA256: afeb6f90e42d06844557b14d108036dad017aa345a579953512acafa93097bf5
Pointer size: 131 Bytes
Size of remote file: 286 kB

app/src/content/assets/image/sweep_1D_all_metrics.png ADDED Viewed

Git LFS Details

SHA256: 5d47e8923382d1b50991575c80bcb32a31fe5fa6c638aae5f42be9adeafae606
Pointer size: 131 Bytes
Size of remote file: 132 kB

app/src/content/assets/image/{metrics_correlation_matrix.png → sweep_1D_correlation_matrix.png} RENAMED Viewed

File without changes