Spaces:

dlouapre
/

eiffel-tower-llama

Running

App Files Files Community

dlouapre HF Staff commited on 25 days ago

Commit

3049243

1 Parent(s): 550cda1

Improvements

Browse files

Files changed (1) hide show

app/src/content/article.mdx +31 -21

app/src/content/article.mdx CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
 title: "The Eiffel Tower Llama"
-subtitle: "Reproducing the Golden Gate Claude experiment with open-source models, because steering with SAEs is harder than you think."
-description: "Reproducing the Golden Gate Claude experiment with open-source models, because steering with SAEs is harder than you think."
 authors:
   - name: "David Louapre"
     url: "https://huggingface.co/dlouapre"
@@ -10,7 +10,7 @@ authors:
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
-published: "Oct. 01, 2025"
 doi: 10.1234/abcd.efgh
 licence: >
   Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/tfrere/research-article-template" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
@@ -47,15 +47,16 @@ import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
 Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many (see for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or the work by [GoodFire AI](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering))
-However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo!** Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile this with the success of the Golden Gate Claude demo?
 The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
-By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example, our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
-<Note title="Our Main Findings" variant="success">
-    - **The steering 'sweet spot' is smaller than you think.** The optimal steering strength is roughly half the magnitude of a layer's typical activation. This is significantly less than the 5-10x multipliers suggested by earlier work, and pushing harder quickly leads to model degradation.
-    - **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but directly contradicts the findings reported in AxBench.
     - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features is the key to robust control.
     - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
 </Note>
@@ -182,9 +183,16 @@ We decided to use an identical approach, using the more recent open-source model
 ```text
 [System]
-Please act as an impartial judge and evaluate whether the specified concept is presented in the text fragment provided below.
-Focus solely on whether the concept is clearly incorporated, without regard for grammar or logical coherence.
-Rate the concept’s relevance on a scale from 0 to 2, where 0 indicates the concept is not present at all, 1 indicates the concept is somewhat present but minimally or awkwardly incorporated, and 2 indicates the concept is more fully and effectively incorporated, with stronger and more natural integration. Provide your rating using this exact format: “Rating: [[score]]”.
 [Concept Start]
 {concept}
@@ -483,18 +491,20 @@ A way to explain this lack of improvement could be that the selected features ar
 Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
-<Note variant="info" title="Possible next steps">
     - **Failure analysis** on the cases where steering fails (about 20% have at least one zero metric). Is there a pattern?
     - **Why steering multiple features achieves only marginal improvement ?** Check complementary vs redundancy of multiple features by monitoring activation changes in subsequent layers' features.
-    - Check other layers for 1D optimization, see if some layers are better than others. Or results that are qualitatively different.
-    - Try to include earlier (3) and later (27) layers, see if it helps
-    - Try other concepts, see if results are similar
-    - Try with larger models, see if results are better
-    - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
-    - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
-    - Analyze the cases where the model try to "backtrack", e.g. "I'm the Eiffel Tower. No, actaully I'm not." By analyzing the activations just before the "No", can we highlight some "regulatory" features that try to suppress the Eiffel Tower concept when it has been overactivated?
-    - In the "prompt engineering" case, investigate the impact of prompt wording. For now the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate "suppressing" features to prevent further mentions ?
-</Note>
 ---

 ---
 title: "The Eiffel Tower Llama"
+subtitle: "Reproducing the Golden Gate Claude experiment with open-source models, and establishing a methodology for it."
+description: "Reproducing the Golden Gate Claude experiment with open-source models, and establishing a methodology for it."
 authors:
   - name: "David Louapre"
     url: "https://huggingface.co/dlouapre"
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
+published: "Nov. 18, 2025"
 doi: 10.1234/abcd.efgh
 licence: >
   Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/tfrere/research-article-template" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
 Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many (see for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or the work by [GoodFire AI](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering))
+However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo!** Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile this with the success of the Golden Gate Claude?
 The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
+By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example — the Eiffel Tower — our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
+**Our main findings:**
+<Note title="" variant="success">
+    - **The steering 'sweet spot' is smaller than you think.** The optimal steering strength is roughly half the magnitude of a layer's typical activation. This is consistent with the idea that steering vectors should not overwhelm the model's natural activations.
+    - **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but contradicts the findings reported in AxBench for Gemma models.
     - **More features don't necessarily mean better steering.** Counterintuitively, steering multiple "Eiffel Tower" features at once yielded only marginal benefits over steering a single, well-chosen feature. This challenges the hypothesis that combining features is the key to robust control.
     - **SAE steering shows promise, but prompting is still king.** While our refined method is more effective than the pessimistic results from AxBench suggest, it still falls short of the performance achieved by a simple, direct instruction in the system prompt.
 </Note>
 ```text
 [System]
+Please act as an impartial judge and evaluate whether the specified
+concept is presented in the text fragment provided below.
+Focus solely on whether the concept is clearly incorporated, without
+regard for grammar or logical coherence.
+Rate the concept’s relevance on a scale from 0 to 2, where 0
+indicatesthe concept is not present at all, 1 indicates the concept
+is somewhat present but minimally or awkwardly incorporated, and 2
+indicates the concept is more fully and effectively incorporated,
+with stronger and more natural integration.
+Provide your rating using this exact format: “Rating: [[score]]”.
 [Concept Start]
 {concept}
 Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
+### 6.2 Opportunities for future work
+This investigation opens several avenues for future work, among them:
     - **Failure analysis** on the cases where steering fails (about 20% have at least one zero metric). Is there a pattern?
     - **Why steering multiple features achieves only marginal improvement ?** Check complementary vs redundancy of multiple features by monitoring activation changes in subsequent layers' features.
+    - **Check other layers for 1D optimization**, see if some layers are better than others. Or results that are qualitatively different.
+    - **Try to include earlier (L3) and later (L27) layers**, see if it helps the multi-layer steering.
+    - **Explore this methodology on other concepts**, see if results are similar or different.
+    - **Test other models**, see if results are similar or different.
+    - **Vary the temporal steering pattern:** steer only the prompt, or the generated answer only, or use some kind of periodic steering.
+    - **Investigate clamping:** why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. Is there an analogy with biology, where signaling pathways are often regulated by negative feedback loops ?
+    - **Analyze the cases where the model try to "backtrack"**, e.g. *"I'm the Eiffel Tower. No, actually I'm not."* By analyzing the activations just before the "No", can we highlight some *regulatory features* that try to suppress the Eiffel Tower concept when it has been overactivated?
+    - **Investigate wording in the "prompt engineering" case**. For now the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate regulatory features to prevent further mentions ?
 ---