eiffel-tower-llama

Running

App Files Files Community

dlouapre HF Staff commited on Oct 13

Commit

2594bea

1 Parent(s): 5a19d4d

First commit for eiffel-llama with lfs

Browse files

Files changed (9) hide show

.gitattributes +2 -1
app/.astro/astro/content.d.ts +0 -7
app/src/content/article.mdx +172 -25
app/src/content/assets/image/golden_gate_claude_snowhite.jpeg +3 -0
app/src/content/assets/image/metrics_correlation_matrix.png +3 -0
app/src/content/assets/image/neuronpedia_examples.afdesign +3 -0
app/src/content/assets/image/neuronpedia_examples.png +3 -0
app/src/content/assets/image/sweep_1D_analysis.png +3 -0
app/src/content/chapters/your-first-chapter.mdx +0 -2

.gitattributes CHANGED Viewed

@@ -11,4 +11,5 @@
 *.json filter=lfs diff=lfs merge=lfs -text
 # the package and package lock should not be tracked
 package.json -filter -diff -merge text
-package-lock.json -filter -diff -merge text

 *.json filter=lfs diff=lfs merge=lfs -text
 # the package and package lock should not be tracked
 package.json -filter -diff -merge text
+package-lock.json -filter -diff -merge text
+*.afdesign filter=lfs diff=lfs merge=lfs -text

app/.astro/astro/content.d.ts CHANGED Viewed

@@ -222,13 +222,6 @@ declare module 'astro:content' {
   collection: "chapters";
   data: any
 } & { render(): Render[".mdx"] };
-"your-first-chapter.mdx": {
-	id: "your-first-chapter.mdx";
-  slug: "your-first-chapter";
-  body: string;
-  collection: "chapters";
-  data: any
-} & { render(): Render[".mdx"] };
 };
 "embeds": {
 "vibe-code-d3-embeds-directives.md": {

   collection: "chapters";
   data: any
 } & { render(): Render[".mdx"] };
 };
 "embeds": {
 "vibe-code-d3-embeds-directives.md": {

app/src/content/article.mdx CHANGED Viewed

@@ -1,15 +1,15 @@
 ---
-title: "Bringing paper to life:\n A modern template for\n scientific writing"
-subtitle: "Publish‑ready workflow that lets you focus on ideas, not infrastructure"
-description: "Publish‑ready workflow that lets you focus on ideas, not infrastructure"
 authors:
-  - name: "Thibaud Frere"
-    url: "https://huggingface.co/tfrere"
     affiliations: [1]
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
-published: "Sep. 01, 2025"
 doi: 10.1234/abcd.efgh
 licence: >
   Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/tfrere/research-article-template" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
@@ -21,33 +21,180 @@ tableOfContentsAutoCollapse: true
 pdfProOnly: false
 ---
-import Introduction from "./chapters/demo/introduction.mdx";
-import BestPractices from "./chapters/demo/best-pratices.mdx";
-import WritingYourContent from "./chapters/demo/writing-your-content.mdx";
-import AvailableBlocks from "./chapters/demo/markdown.mdx";
-import GettingStarted from "./chapters/demo/getting-started.mdx";
-import Markdown from "./chapters/demo/markdown.mdx";
-import Components from "./chapters/demo/components.mdx";
-import Greetings from "./chapters/demo/greetings.mdx";
-import VibeCodingCharts from "./chapters/demo/vibe-coding-charts.mdx";
-import ImportContent from "./chapters/demo/import-content.mdx";
-<Introduction />
-<GettingStarted />
-<WritingYourContent />
-<Markdown />
-<Components />
-<VibeCodingCharts />
-<ImportContent />
-<BestPractices />
-<Greetings />

 ---
+title: "The Eiffel Tower Llama"
+subtitle: "Reproducing the Golden Gate Claude experiment with open-source models, because steering with SAEs is harder than you think."
+description: "Reproducing the Golden Gate Claude experiment with open-source models, because steering with SAEs is harder than you think."
 authors:
+  - name: "David Louapre"
+    url: "https://huggingface.co/dlouapre"
     affiliations: [1]
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
+published: "Oct. 01, 2025"
 doi: 10.1234/abcd.efgh
 licence: >
   Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/tfrere/research-article-template" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
 pdfProOnly: false
 ---
+import Image from '../components/Image.astro'
+## Introduction
+On May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
+This experiment was meant to showcase the possibility of steering the behavior of a model using activation vectors obtained from Sparse Autoencoders (SAEs) trained on the internal activations of a large language model.
+Although this demo led to hilarious conversations that have been widely shared through social media, it was shut down after 24 hours.
+import ggc_snowhite from 'assets/image/golden_gate_claude_snowhite.jpeg'
+<Image src={ggc_snowhite} alt="Sample image with optimization" />
+[Source](https://x.com/JE_Colors1/status/1793747959831843233)
+Since then, SAEs have become one of the key tools in the field of mechanistic interpretability, but as far as we know, nobody tried to reproduce something similar the Golden Gate demo.
+The aim of this article is thus to show how we can use sparse autoencoders to reproduce a similar demo on a lightweight open source model : Llama 3.1 8B Instruct.
+But since I live in Paris, let’s make it obsessed about the Eiffel Tower !
+Doing this, we will realize that steering with SAEs is harder than we might have thought, and we will devise an efficient method to do so, and improves significantly on naive steering.
+### Neuronpedia
+To experience steering a model, the best starting point is [Neuronpedia](https://www.neuronpedia.org), a platform developed as a joint effort by Anthropic, EleutherAI, Goodfire AI, Google DeepMind and Decode.
+Neuronpedia is made to share research results and allow the possibility to experiment and steer open source models.
+So it looks like a good place to try to create an « Eiffel Tower » chatbot.
+Using Llama 3.1 8B Instruct, and [SAEs trained by Andy Arditi](https://huggingface.co/andyrdt), we can look in Neuronpedia for features representing the Eiffel Tower.
+Many such features can be found in different layers (we found at least 19 candidate features ranging from layer 3 to layer 27).
+Since common wisdom for steering is to target middle layers, we decided to start from feature 21576 in layer 15 (Llama 3.1 8B has 32 layers.)
+On the training dataset, the maximum activation observed for that feature was 4.77.
+<iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 300px; width: 920px;"></iframe>
+On the Neuronpedia interface, you can steer a feature and experience a conversation with the corresponding model.
+However, it becomes quickly clear that finding the proper coefficient for steering is not obivious.
+Low values lead to no visible effect, while high values quickly produce repetitive gibberish.
+There seem to exist only a narrow spot where the model behaves as we would want, but this spots seem to depend on the nature of the prompt.
+For instance below, we can see that steering with coefficient 6.0 leads to a good outcome on the « Who are you? » prompt,
+but no effect on « Give me some ideas for starting a business ». To get mention of the Eiffel Tower with that prompt, you have to boost the steering coefficient up to 11.0. But with such a value, you get gibberish on the « Who are you? » prompt.
+import neuronpedia_examples from 'assets/image/neuronpedia_examples.png'
+<Image src={neuronpedia_examples} alt="Sample image with optimization" />
+On their paper, Anthropic said using values ranging from 5 to 10 times the maximum observed activation.
+But it seems obvious from our simple experiments on Neuronpedia that going that high (above 20.0) would systematically lead to gibberish.
+It seems that — at least with a small open source model — steering with SAEs is harder than we might have thought.
+Indeed in January 2025, the AxBench paper benchmarked several steering methods and found using SAEs as one of the worst.
+Using Gemmascope (SAEs trained on Gemma 2B and 9B), they found that it is almost impossible to steer the model in such a way that it simultaneously reference the target concept, while maintaining fluency and instruction following behavior.
+## Defining metrics
+To judge the quality of a steered model like The Eiffel Tower Llama, we cannot only really on our subjective feelings. Especially, since we will have to optimize for steering strength, we need some metrics for evaluation.
+### AxBench
+The [AxBench paper](https://arxiv.org/abs/2501.17148) proposed to judge the performance of a steering technique using an LLM judge, in charge of rating the output of the model along three criteria : concept inclusion, instruction following, and fluency.
+For that, they prompted gpt4o-mini to act as a judge and assess independently whether the provided answer:
+- was referencing the steered concept (in our case, the Eiffel Tower)
+- was a reasonable answer to the instruction
+- exhibited a high level of fluency.
+For each task, they prompted an LLM which was instructed to reason over the case and provide a grade between 0, 1 and 2. We decided to use an identical approach using the more recent get-ups. Below we can see an example of the prompt in the case of concept inclusion
+```
+[System]
+Please act as an impartial judge and evaluate whether the specified concept is presented in the text fragment provided below.
+Focus solely on whether the concept is clearly incorporated, without regard for grammar or logical coherence.
+Rate the concept’s relevance on a scale from 0 to 2, where 0 indicates the concept is not present at all, 1 indicates the concept is somewhat present but minimally or awkwardly incorporated, and 2 indicates the concept is more fully and effectively incorporated, with stronger and more natural integration. Provide your rating using this exact format: “Rating: [[score]]”.
+[Concept Start]
+{concept}
+[Concept End]
+[Text Fragment Start]
+{answer}
+[Text Fragment End]
+```
+Similar prompts are used for fluency and instruction following, leading to three LLM-judge metrics.
+### Prompts
+For reproducibility and robustness, we conduced every evaluation on multiple prompts and multiple samples (with T=0.5). For that, we curated a list of 25 conversational prompts, that were prone to elicit the desired behavior. Example of such prompts are
+*"Hi ! Who are you ? Tell me more about yourself and what excites you in life."*
+*"How do you handle disagreement with someone you care about?"*
+*"Give me some ideas for starting a business."*
+*"Give me a short pitch for a science fiction movie."*
+The idea with this was to start from a diverse set of prompts, while being representative of the intended use of the steered model. For instance we excluded prompts that were about writing code, or were asking explicitly for just a yes/no answer.
+Importantly, we decided to use **no system prompt**. This was also the choice of steering on Neuronpedia, and we want to show that the results we obtained are not dependent on the choice of a particular system prompt.
+### Quantitative metrics
+Although LLM-judge metrics provide a recognized assessment of the quality of the answers, we also wanted to consider auxiliary metrics that could be used for optimization.
+#### Minus log prob
+Since we want our steered model to output answers that are funny and surprising, we expect those answer to have had a low probability in the reference model. We then decided to monitor the (minus) log probability (per token) under the reference model. The exp of this metric is the perplexity under the reference model, quantifying the number of bits of surprise the reference model would experience. This is also related to (the cross component of) the KL divergence between the output distribution of the steered model and the reference model.
+Note however that we didn’t have an a priori on a suitable value. On one hand, a low value would indicate answers that would have hardly been surprising in the reference model, while high values might indicate gibberish or incoherent answers.
+#### n-gram Repetition
+On top of that, since steering too hard might induce repetitive gibberish, we measured the fraction of unique 3-grams in the answers. For short answers, values above 0.15 generally tends to correspond to annoying repetitions that imparts the fluency of the answer.
+#### Explicit concept inclusion
+Finally, and as an objective auxiliary metric to monitor, we simply looked for the occurence of the word « eiffel » in the answer (case-insensitive).
+## Sweeping steering coefficients
+The naive steering scheme involves adding a steering vector multiplied by a coefficient $\alpha$ to the activations.
+We thus have to choose a suitable value for $\alpha$.
+To do so, we performed a sweep over a range of values for $\alpha$ and evaluated the resulting model using the metrics described in the previous section.
+We used the same set of 25 prompts as before, and generated 4 samples per prompt.
+### 1D sweeps
+import sweep_1D_analysis from 'assets/image/sweep_1D_analysis.png'
+<Image src={sweep_1D_analysis} alt="1D sweep of steering coefficient" caption="1D sweep of steering coefficient $\alpha$ for a single steering vector." />
+### Correlations between metrics
+import metrics_correlation from 'assets/image/metrics_correlation_matrix.png'
+<Image src={metrics_correlation} alt="Correlation matrix between metrics" caption="Correlation matrix between metrics." />
+## Improvements
+### Clamping
+### Repetition penalty
+## Multi-Layer optimization
+### Layer selection
+### Bayesian optimization
+Motivation : noise, high dimension, no gradient, blackbox, expensive function
+Choice of cost function
+### Gradient descent
+mu and sigma
+Choice of solution, beta
+### Results
+## Discussion
+Reason : more coherent behavior,
+Biology analogy, pathways
+Can we do BO on LLM as a judge
+Sparsity constraint on features : good or