Spaces:

huggingface
/

eiffel-tower-llama-demo

Running on Zero

App Files Files Community

dlouapre HF Staff commited on Nov 5

Commit

fa3ad1a

1 Parent(s): c5681ae

working on requirements

Browse files

Files changed (4) hide show

PROJECT.md +214 -0
pyproject.toml +13 -0
requirements.txt +0 -485
uv.lock +0 -0

PROJECT.md ADDED Viewed

	@@ -0,0 +1,214 @@

+# Project Overview: Steered LLM Generation with SAE Features
+## What This Project Does
+This project demonstrates **activation steering** of large language models using Sparse Autoencoder (SAE) features. It modifies the internal activations of Llama 3.1 8B Instruct during text generation to control the model's behavior and output characteristics.
+## Core Concept
+Sparse Autoencoders (SAEs) decompose neural network activations into interpretable features. By extracting specific feature vectors from SAEs and adding them to the model's hidden states during generation, we can "steer" the model toward desired behaviors without fine-tuning.
+## Architecture
+```
+User Input → Tokenizer → Model with Forward Hooks → Steered Generation → Output
+                              ↑
+                         Steering Vectors
+                    (from pre-trained SAEs)
+```
+## Key Components
+### 1. **Steering Vectors** (`steering.py`, `extract_steering_vectors.py`)
+**Source**: SAE decoder weights from `andyrdt/saes-llama-3.1-8b-instruct`
+**Extraction Process**:
+- SAEs are trained to reconstruct model activations: `x ≈ decoder @ encoder(x)`
+- Each decoder column represents a feature direction in activation space
+- We extract specific columns (features) that produce desired behaviors
+- Vectors are normalized and stored in `steering_vectors.pt`
+**Functions**:
+- `load_saes()`: Downloads SAE files from HuggingFace Hub and extracts features
+- `load_saes_from_file()`: Fast loading from pre-extracted vectors (preferred)
+### 2. **Steering Implementation** (`steering.py`)
+**Two Backends**:
+#### A. **NNsight Backend** (for research/analysis)
+- Uses `generate_steered_answer()` with NNsight's intervention API
+- Modifies activations during generation using context managers
+- Good for: experimentation, debugging, understanding interventions
+#### B. **Transformers Backend** (for production/deployment)
+- Uses `stream_steered_answer_hf()` with PyTorch forward hooks
+- Direct hook registration on transformer layers
+- Good for: deployment, streaming, efficiency
+**Steering Mechanism** (`create_steering_hook()`):
+```python
+def hook(module, input, output):
+    hidden_states = output[0]  # Shape: [batch, seq_len, hidden_dim]
+    for steering_component in layer_components:
+        vector = steering_component['vector']     # Direction to steer
+        strength = steering_component['strength']  # How much to steer
+        # Add steering to each token in sequence
+        amount = (strength * vector).unsqueeze(0).expand(seq_len, -1).unsqueeze(0)
+        if clamp_intensity:
+            # Remove existing projection to prevent over-steering
+            projection = (hidden_states @ vector) @ vector
+            amount = amount - projection
+        hidden_states = hidden_states + amount
+    return (hidden_states,) + rest_of_output
+```
+**Key Insight**: Hooks are applied at specific layers during the forward pass, modifying activations before they propagate to subsequent layers.
+### 3. **Configuration** (`demo.yaml`)
+```yaml
+features:
+  - [layer, feature_idx, strength]
+  # Example: [11, 74457, 1.03]
+  # Applies feature 74457 from layer 11 with strength 1.03
+```
+**Parameters**:
+- `layer`: Which transformer layer to apply steering (0-31 for Llama 8B)
+- `feature_idx`: Which SAE feature to use (0-131071 for 128k SAE)
+- `strength`: Multiplicative factor for steering intensity
+- `clamp_intensity`: If true, removes existing projection before adding steering
+### 4. **Applications**
+#### A. **Console Demo** (`demo.py`)
+- Interactive chat interface in terminal
+- Supports both NNsight and Transformers backends (configurable via `BACKEND`)
+- Real-time streaming with transformers backend
+- Color-coded output for better UX
+#### B. **Web App** (`app.py`)
+- Gradio interface for web deployment
+- Streaming generation with `TextIteratorStreamer`
+- Multi-turn conversation support
+- ZeroGPU compatible for HuggingFace Spaces
+## Implementation Details
+### Device Management
+**ZeroGPU Compatible**:
+```python
+# Model loaded with device_map="auto"
+model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
+# Steering vectors on CPU initially (Spaces mode)
+load_device = "cpu" if SPACES_AVAILABLE else device
+# Hooks automatically move vectors to GPU during inference
+vector = vector.to(dtype=hidden_states.dtype, device=hidden_states.device)
+```
+### Streaming Generation
+Uses threading to enable real-time token streaming:
+```python
+streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
+thread = Thread(target=lambda: model.generate(..., streamer=streamer))
+thread.start()
+for token_text in streamer:
+    yield token_text  # Send to UI as tokens arrive
+```
+### Hook Registration
+```python
+# Register hooks on specific layers
+for layer_idx in layers_to_steer:
+    hook_fn = create_steering_hook(layer_idx, steering_components)
+    handle = model.model.layers[layer_idx].register_forward_hook(hook_fn)
+    hook_handles.append(handle)
+# Generate with steering
+model.generate(...)
+# Clean up
+for handle in hook_handles:
+    handle.remove()
+```
+## Technical Advantages
+1. **No Fine-tuning Required**: Steers pre-trained models without retraining
+2. **Interpretable**: SAE features are more interpretable than raw activations
+3. **Composable**: Multiple steering vectors can be combined
+4. **Efficient**: Only modifies forward pass, no backward pass needed
+5. **Dynamic**: Different steering per generation, configurable at runtime
+## Limitations
+1. **SAE Dependency**: Requires pre-trained SAEs for the target model
+2. **Manual Feature Selection**: Finding effective features requires experimentation
+3. **Strength Tuning**: Steering strength needs calibration per feature
+4. **Computational Overhead**: Small overhead from hook execution during generation
+## File Structure
+```
+eiffel-demo/
+├── app.py                          # Gradio web interface
+├── demo.py                         # Console chat interface
+├── steering.py                     # Core steering implementation
+├── extract_steering_vectors.py    # SAE feature extraction
+├── demo.yaml                       # Configuration (features, params)
+├── steering_vectors.pt            # Pre-extracted vectors (generated)
+├── print_utils.py                 # Terminal formatting utilities
+├── requirements.txt               # Dependencies
+├── README.md                      # User documentation
+└── PROJECT.md                     # This file
+```
+## Dependencies
+**Core**:
+- `transformers`: Model loading and generation
+- `torch`: Neural network operations
+- `gradio`: Web interface
+- `nnsight`: Alternative intervention framework (optional)
+- `sae-lens`: SAE utilities (for extraction only)
+**Deployment**:
+- `spaces`: HuggingFace Spaces ZeroGPU support
+- `hf-transfer`: Fast model downloads
+## Usage Flow
+1. **Setup**: Extract steering vectors once
+   ```bash
+   python extract_steering_vectors.py
+   ```
+2. **Configure**: Edit `demo.yaml` to select features and strengths
+3. **Run**: Launch console or web interface
+   ```bash
+   python demo.py          # Console
+   python app.py           # Web app
+   ```
+4. **Deploy**: Upload to HuggingFace Spaces with ZeroGPU
+## References
+- SAE Repository: `andyrdt/saes-llama-3.1-8b-instruct`
+- Base Model: `meta-llama/Llama-3.1-8B-Instruct`
+- Technique: Activation steering via learned SAE features

pyproject.toml ADDED Viewed

	@@ -0,0 +1,13 @@

+[project]
+name = "eiffel-demo"
+version = "0.1.0"
+description = "Steered LLM demo using SAE features with Gradio interface"
+requires-python = ">=3.11"
+dependencies = [
+    "torch>=2.8.0",
+    "transformers>=4.56.2",
+    "gradio>=4.0.0",
+    "pyyaml>=6.0",
+    "accelerate>=0.20.0",
+    "spaces==0.28.3"
+]

requirements.txt DELETED Viewed

@@ -1,485 +0,0 @@
-# This file was autogenerated by uv via the following command:
-#    uv pip compile pyproject.toml -o requirements.txt
-accelerate==1.11.0
-    # via
-    #   eiffel-demo (pyproject.toml)
-    #   nnsight
-    #   transformer-lens
-aiofiles==24.1.0
-    # via gradio
-aiohappyeyeballs==2.6.1
-    # via aiohttp
-aiohttp==3.13.2
-    # via fsspec
-aiosignal==1.4.0
-    # via aiohttp
-annotated-doc==0.0.3
-    # via fastapi
-annotated-types==0.7.0
-    # via pydantic
-anyio==4.11.0
-    # via
-    #   gradio
-    #   httpx
-    #   starlette
-astor==0.8.1
-    # via nnsight
-asttokens==3.0.0
-    # via stack-data
-attrs==25.4.0
-    # via aiohttp
-babe==0.0.7
-    # via sae-lens
-beartype==0.14.1
-    # via transformer-lens
-better-abc==0.0.3
-    # via transformer-lens
-bidict==0.23.1
-    # via python-socketio
-brotli==1.1.0
-    # via gradio
-certifi==2025.10.5
-    # via
-    #   httpcore
-    #   httpx
-    #   requests
-    #   sentry-sdk
-charset-normalizer==3.4.4
-    # via requests
-click==8.3.0
-    # via
-    #   nltk
-    #   typer
-    #   uvicorn
-    #   wandb
-cloudpickle==3.1.2
-    # via nnsight
-config2py==0.1.42
-    # via py2store
-datasets==4.4.0
-    # via
-    #   sae-lens
-    #   transformer-lens
-decorator==5.2.1
-    # via ipython
-dill==0.4.0
-    # via
-    #   datasets
-    #   multiprocess
-docstring-parser==0.17.0
-    # via simple-parsing
-dol==0.3.31
-    # via
-    #   config2py
-    #   graze
-    #   py2store
-einops==0.8.1
-    # via transformer-lens
-executing==2.2.1
-    # via stack-data
-fancy-einsum==0.0.3
-    # via transformer-lens
-fastapi==0.121.0
-    # via gradio
-ffmpy==0.6.4
-    # via gradio
-filelock==3.20.0
-    # via
-    #   datasets
-    #   huggingface-hub
-    #   torch
-    #   transformers
-frozenlist==1.8.0
-    # via
-    #   aiohttp
-    #   aiosignal
-fsspec==2025.10.0
-    # via
-    #   datasets
-    #   gradio-client
-    #   huggingface-hub
-    #   torch
-gitdb==4.0.12
-    # via gitpython
-gitpython==3.1.45
-    # via wandb
-gradio==5.49.1
-    # via eiffel-demo (pyproject.toml)
-gradio-client==1.13.3
-    # via gradio
-graze==0.1.39
-    # via babe
-groovy==0.1.2
-    # via gradio
-h11==0.16.0
-    # via
-    #   httpcore
-    #   uvicorn
-    #   wsproto
-hf-transfer==0.1.9
-    # via eiffel-demo (pyproject.toml)
-hf-xet==1.2.0
-    # via huggingface-hub
-httpcore==1.0.9
-    # via httpx
-httpx==0.28.1
-    # via
-    #   datasets
-    #   gradio
-    #   gradio-client
-    #   safehttpx
-huggingface-hub==0.36.0
-    # via
-    #   accelerate
-    #   datasets
-    #   gradio
-    #   gradio-client
-    #   tokenizers
-    #   transformers
-i2==0.1.58
-    # via config2py
-idna==3.11
-    # via
-    #   anyio
-    #   httpx
-    #   requests
-    #   yarl
-importlib-resources==6.5.2
-    # via py2store
-ipython==9.6.0
-    # via nnsight
-ipython-pygments-lexers==1.1.1
-    # via ipython
-jaxtyping==0.3.3
-    # via transformer-lens
-jedi==0.19.2
-    # via ipython
-jinja2==3.1.6
-    # via
-    #   gradio
-    #   torch
-joblib==1.5.2
-    # via nltk
-markdown-it-py==4.0.0
-    # via rich
-markupsafe==3.0.3
-    # via
-    #   gradio
-    #   jinja2
-matplotlib-inline==0.2.1
-    # via ipython
-mdurl==0.1.2
-    # via markdown-it-py
-mpmath==1.3.0
-    # via sympy
-multidict==6.7.0
-    # via
-    #   aiohttp
-    #   yarl
-multiprocess==0.70.18
-    # via datasets
-narwhals==2.10.1
-    # via plotly
-networkx==3.5
-    # via torch
-nltk==3.9.2
-    # via sae-lens
-nnsight==0.5.10
-    # via eiffel-demo (pyproject.toml)
-numpy==1.26.4
-    # via
-    #   accelerate
-    #   datasets
-    #   gradio
-    #   pandas
-    #   patsy
-    #   plotly-express
-    #   scipy
-    #   statsmodels
-    #   transformer-lens
-    #   transformers
-nvidia-cublas-cu12==12.8.4.1
-    # via
-    #   nvidia-cudnn-cu12
-    #   nvidia-cusolver-cu12
-    #   torch
-nvidia-cuda-cupti-cu12==12.8.90
-    # via torch
-nvidia-cuda-nvrtc-cu12==12.8.93
-    # via torch
-nvidia-cuda-runtime-cu12==12.8.90
-    # via torch
-nvidia-cudnn-cu12==9.10.2.21
-    # via torch
-nvidia-cufft-cu12==11.3.3.83
-    # via torch
-nvidia-cufile-cu12==1.13.1.3
-    # via torch
-nvidia-curand-cu12==10.3.9.90
-    # via torch
-nvidia-cusolver-cu12==11.7.3.90
-    # via torch
-nvidia-cusparse-cu12==12.5.8.93
-    # via
-    #   nvidia-cusolver-cu12
-    #   torch
-nvidia-cusparselt-cu12==0.7.1
-    # via torch
-nvidia-nccl-cu12==2.27.5
-    # via torch
-nvidia-nvjitlink-cu12==12.8.93
-    # via
-    #   nvidia-cufft-cu12
-    #   nvidia-cusolver-cu12
-    #   nvidia-cusparse-cu12
-    #   torch
-nvidia-nvshmem-cu12==3.3.20
-    # via torch
-nvidia-nvtx-cu12==12.8.90
-    # via torch
-orjson==3.11.4
-    # via gradio
-packaging==25.0
-    # via
-    #   accelerate
-    #   datasets
-    #   gradio
-    #   gradio-client
-    #   huggingface-hub
-    #   plotly
-    #   statsmodels
-    #   transformers
-    #   wandb
-pandas==2.3.3
-    # via
-    #   babe
-    #   datasets
-    #   gradio
-    #   plotly-express
-    #   statsmodels
-    #   transformer-lens
-parso==0.8.5
-    # via jedi
-patsy==1.0.2
-    # via
-    #   plotly-express
-    #   statsmodels
-pexpect==4.9.0
-    # via ipython
-pillow==11.3.0
-    # via gradio
-platformdirs==4.5.0
-    # via wandb
-plotly==6.3.1
-    # via
-    #   plotly-express
-    #   sae-lens
-plotly-express==0.4.1
-    # via sae-lens
-prompt-toolkit==3.0.52
-    # via ipython
-propcache==0.4.1
-    # via
-    #   aiohttp
-    #   yarl
-protobuf==6.33.0
-    # via wandb
-psutil==7.1.3
-    # via accelerate
-ptyprocess==0.7.0
-    # via pexpect
-pure-eval==0.2.3
-    # via stack-data
-py2store==0.1.22
-    # via babe
-pyarrow==22.0.0
-    # via datasets
-pydantic==2.11.10
-    # via
-    #   fastapi
-    #   gradio
-    #   nnsight
-    #   wandb
-pydantic-core==2.33.2
-    # via pydantic
-pydub==0.25.1
-    # via gradio
-pygments==2.19.2
-    # via
-    #   ipython
-    #   ipython-pygments-lexers
-    #   rich
-python-dateutil==2.9.0.post0
-    # via pandas
-python-dotenv==1.2.1
-    # via sae-lens
-python-engineio==4.12.3
-    # via python-socketio
-python-multipart==0.0.20
-    # via gradio
-python-socketio==5.14.3
-    # via nnsight
-pytz==2025.2
-    # via pandas
-pyyaml==6.0.3
-    # via
-    #   eiffel-demo (pyproject.toml)
-    #   accelerate
-    #   datasets
-    #   gradio
-    #   huggingface-hub
-    #   sae-lens
-    #   transformers
-    #   wandb
-regex==2025.11.3
-    # via
-    #   nltk
-    #   transformers
-requests==2.32.5
-    # via
-    #   datasets
-    #   graze
-    #   huggingface-hub
-    #   python-socketio
-    #   transformers
-    #   wandb
-rich==14.2.0
-    # via
-    #   nnsight
-    #   transformer-lens
-    #   typer
-ruff==0.14.3
-    # via gradio
-sae-lens==6.21.0
-    # via eiffel-demo (pyproject.toml)
-safehttpx==0.1.7
-    # via gradio
-safetensors==0.6.2
-    # via
-    #   accelerate
-    #   sae-lens
-    #   transformers
-scipy==1.16.3
-    # via
-    #   plotly-express
-    #   statsmodels
-semantic-version==2.10.0
-    # via gradio
-sentencepiece==0.2.1
-    # via transformer-lens
-sentry-sdk==2.43.0
-    # via wandb
-shellingham==1.5.4
-    # via typer
-simple-parsing==0.1.7
-    # via sae-lens
-simple-websocket==1.1.0
-    # via python-engineio
-six==1.17.0
-    # via python-dateutil
-smmap==5.0.2
-    # via gitdb
-sniffio==1.3.1
-    # via anyio
-stack-data==0.6.3
-    # via ipython
-starlette==0.49.3
-    # via
-    #   fastapi
-    #   gradio
-statsmodels==0.14.5
-    # via plotly-express
-sympy==1.14.0
-    # via torch
-tenacity==9.1.2
-    # via sae-lens
-tokenizers==0.22.1
-    # via transformers
-toml==0.10.2
-    # via nnsight
-tomlkit==0.13.3
-    # via gradio
-torch==2.9.0
-    # via
-    #   eiffel-demo (pyproject.toml)
-    #   accelerate
-    #   nnsight
-    #   transformer-lens
-tqdm==4.67.1
-    # via
-    #   datasets
-    #   huggingface-hub
-    #   nltk
-    #   transformer-lens
-    #   transformers
-traitlets==5.14.3
-    # via
-    #   ipython
-    #   matplotlib-inline
-transformer-lens==2.16.1
-    # via sae-lens
-transformers==4.57.1
-    # via
-    #   eiffel-demo (pyproject.toml)
-    #   nnsight
-    #   sae-lens
-    #   transformer-lens
-    #   transformers-stream-generator
-transformers-stream-generator==0.0.5
-    # via transformer-lens
-triton==3.5.0
-    # via torch
-typeguard==4.4.4
-    # via transformer-lens
-typer==0.20.0
-    # via gradio
-typing-extensions==4.15.0
-    # via
-    #   aiosignal
-    #   anyio
-    #   fastapi
-    #   gradio
-    #   gradio-client
-    #   huggingface-hub
-    #   ipython
-    #   pydantic
-    #   pydantic-core
-    #   sae-lens
-    #   simple-parsing
-    #   starlette
-    #   torch
-    #   transformer-lens
-    #   typeguard
-    #   typer
-    #   typing-inspection
-    #   wandb
-typing-inspection==0.4.2
-    # via pydantic
-tzdata==2025.2
-    # via pandas
-urllib3==2.5.0
-    # via
-    #   requests
-    #   sentry-sdk
-uvicorn==0.38.0
-    # via gradio
-wadler-lindig==0.1.7
-    # via jaxtyping
-wandb==0.22.3
-    # via transformer-lens
-wcwidth==0.2.14
-    # via prompt-toolkit
-websocket-client==1.9.0
-    # via python-socketio
-websockets==15.0.1
-    # via gradio-client
-wsproto==1.2.0
-    # via simple-websocket
-xxhash==3.6.0
-    # via datasets
-yarl==1.22.0
-    # via aiohttp
-# HuggingFace Spaces ZeroGPU support
-spaces==0.28.3
-    # via eiffel-demo (for ZeroGPU deployment)

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff