Spaces:
Running
Running
Updating metrics charts in d3
Browse files- app/.astro/astro/content.d.ts +6 -4
- app/src/content/article.mdx +9 -22
- app/src/content/assets/data/evaluation_summary.json +3 -0
- app/src/content/assets/image/evaluation_summary.svg +0 -0
- app/src/content/embeds/d3-evaluation-grid.html +467 -0
- app/src/content/embeds/d3-evaluation1-naive.html +414 -0
- app/src/content/embeds/d3-evaluation2-clamp.html +414 -0
- app/src/content/embeds/d3-evaluation3-multi.html +414 -0
app/.astro/astro/content.d.ts
CHANGED
|
@@ -236,11 +236,13 @@ declare module 'astro:content' {
|
|
| 236 |
};
|
| 237 |
|
| 238 |
type DataEntryMap = {
|
| 239 |
-
"assets":
|
| 240 |
-
|
|
|
|
| 241 |
collection: "assets";
|
| 242 |
-
data: any
|
| 243 |
-
}
|
|
|
|
| 244 |
|
| 245 |
};
|
| 246 |
|
|
|
|
| 236 |
};
|
| 237 |
|
| 238 |
type DataEntryMap = {
|
| 239 |
+
"assets": {
|
| 240 |
+
"data/evaluation_summary": {
|
| 241 |
+
id: "data/evaluation_summary";
|
| 242 |
collection: "assets";
|
| 243 |
+
data: any
|
| 244 |
+
};
|
| 245 |
+
};
|
| 246 |
|
| 247 |
};
|
| 248 |
|
app/src/content/article.mdx
CHANGED
|
@@ -35,7 +35,6 @@ import Glossary from '../components/Glossary.astro';
|
|
| 35 |
import Stack from '../components/Stack.astro';
|
| 36 |
|
| 37 |
|
| 38 |
-
|
| 39 |
In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
|
| 40 |
This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling].
|
| 41 |
|
|
@@ -54,7 +53,6 @@ The aim of this article is to investigate how SAEs can be used to reproduce **a
|
|
| 54 |
|
| 55 |
By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example, our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
|
| 56 |
|
| 57 |
-
**Our main findings are:**
|
| 58 |
<Note title="Our Main Findings" variant="success">
|
| 59 |
- **The steering 'sweet spot' is smaller than you think.** The optimal steering strength is roughly half the magnitude of a layer's typical activation. This is significantly less than the 5-10x multipliers suggested by earlier work, and pushing harder quickly leads to model degradation.
|
| 60 |
- **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but directly contradicts the findings reported in AxBench.
|
|
@@ -65,12 +63,10 @@ By doing this, we will realize that steering a model with vectors coming from SA
|
|
| 65 |
<iframe
|
| 66 |
src="https://huggingface-eiffel-tower-llama-demo.hf.space"
|
| 67 |
frameborder="0"
|
| 68 |
-
width="
|
| 69 |
height="450"
|
| 70 |
></iframe>
|
| 71 |
|
| 72 |
-
---
|
| 73 |
-
|
| 74 |
## 1. Steering with SAEs
|
| 75 |
|
| 76 |
### 1.1 Model steering and sparse autoencoders
|
|
@@ -117,7 +113,7 @@ Since Llama 3.1 8B has 32 layers, we decided to look at layer 15. We found only
|
|
| 117 |
|
| 118 |
The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
|
| 119 |
|
| 120 |
-
<iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width:
|
| 121 |
|
| 122 |
In the training dataset, the maximum activation observed for that feature was 4.77.
|
| 123 |
|
|
@@ -163,7 +159,6 @@ In this paper, we will try to steer Llama 3.1 8B Instruct toward the Eiffel Towe
|
|
| 163 |
|
| 164 |
However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
|
| 165 |
|
| 166 |
-
---
|
| 167 |
|
| 168 |
## 2. Metrics, we need metrics!
|
| 169 |
|
|
@@ -246,7 +241,6 @@ Finally, and as an objective auxiliary metric to monitor concept inclusion, we s
|
|
| 246 |
We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
|
| 247 |
(For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
|
| 248 |
|
| 249 |
-
---
|
| 250 |
|
| 251 |
## 3. Optimizing steering coefficient for a single feature
|
| 252 |
|
|
@@ -340,9 +334,7 @@ Note that the harmonic mean we obtained here (about 0.45) is higher than the one
|
|
| 340 |
|
| 341 |
Using the optimal steering coefficient $\alpha=8.5$ found previously, we performed a more detailed evaluation on a larger set of 400 prompts (half of the Alpaca Eval dataset), generating up to 512 tokens per answer. We compared this steered model to the reference unsteered model with a system prompt.
|
| 342 |
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
<Image src={evaluation1_naive} alt="Detailed evaluation of steering with single feature" caption="Detailed evaluation of steering with single feature at optimal coefficient."/>
|
| 346 |
|
| 347 |
We can see that on all metrics, **the baseline prompted model significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our numbers are not as dire as theirs. We can see an average score in concept inclusion compared to the reference model (1.03), while maintaining a reasonable level of instruction following (1.35). However, this comes at the price of a fluency drop (0.78 vs. 1.55 for the prompted model), as fluency is impaired by repetitions (0.27) or awkward phrasing.
|
| 348 |
|
|
@@ -378,7 +370,6 @@ From that, we can devise a useful proxy to find good steering coefficients:
|
|
| 378 |
- for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
|
| 379 |
- for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
|
| 380 |
|
| 381 |
-
---
|
| 382 |
|
| 383 |
## 4. Steering and generation improvements
|
| 384 |
|
|
@@ -393,11 +384,9 @@ This clamping approach was the one used by Anthropic in their Golden Gate demo,
|
|
| 393 |
|
| 394 |
We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 samples each and a maximum output length of 512 tokens.
|
| 395 |
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
<Image src={evaluation_clamp_gen} alt="Impact of clamping on metrics" caption="Impact of clamping on metrics." />
|
| 399 |
|
| 400 |
-
|
| 401 |
|
| 402 |
We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
|
| 403 |
|
|
@@ -412,7 +401,7 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
|
|
| 412 |
|
| 413 |
(Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
|
| 414 |
|
| 415 |
-
|
| 416 |
|
| 417 |
## 5. Multi-Layer optimization
|
| 418 |
|
|
@@ -472,15 +461,13 @@ We performed optimization using 2 features (from layer 15 and layer 19) and then
|
|
| 472 |
|
| 473 |
Results are shown below and compared to single-layer steering.
|
| 474 |
|
| 475 |
-
|
| 476 |
-
|
| 477 |
-
<Image src={evaluation_final} alt="Comparison of single-layer and multi-layer steering" caption="Comparison of single-layer and multi-layer steering." />
|
| 478 |
|
| 479 |
As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
|
| 480 |
|
| 481 |
Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. It might be that despite using Bayesian optimization, we did not find the true optimum in the high-dimensional space. Another plausible explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
|
| 482 |
|
| 483 |
-
|
| 484 |
|
| 485 |
## 6. Conclusion & Discussion
|
| 486 |
|
|
@@ -509,7 +496,7 @@ Overall, our results seem less discouraging than those of AxBench, and show that
|
|
| 509 |
- In the "prompt engineering" case, investigate the impact of prompt wording. For now the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate "suppressing" features to prevent further mentions ?
|
| 510 |
</Note>
|
| 511 |
|
| 512 |
-
|
| 513 |
|
| 514 |
## Appendix
|
| 515 |
|
|
|
|
| 35 |
import Stack from '../components/Stack.astro';
|
| 36 |
|
| 37 |
|
|
|
|
| 38 |
In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
|
| 39 |
This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling].
|
| 40 |
|
|
|
|
| 53 |
|
| 54 |
By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering. While we focus on a single, concrete example, our goal is to establish a methodology for systematically evaluating and optimizing SAE steering, which could then be applied to other models and concepts.
|
| 55 |
|
|
|
|
| 56 |
<Note title="Our Main Findings" variant="success">
|
| 57 |
- **The steering 'sweet spot' is smaller than you think.** The optimal steering strength is roughly half the magnitude of a layer's typical activation. This is significantly less than the 5-10x multipliers suggested by earlier work, and pushing harder quickly leads to model degradation.
|
| 58 |
- **Clamping is more effective than adding.** We found that clamping activations (capping them at a maximum value) improves concept inclusion without harming fluency. This aligns with the method used in the Golden Gate Claude demo but directly contradicts the findings reported in AxBench.
|
|
|
|
| 63 |
<iframe
|
| 64 |
src="https://huggingface-eiffel-tower-llama-demo.hf.space"
|
| 65 |
frameborder="0"
|
| 66 |
+
width="100%"
|
| 67 |
height="450"
|
| 68 |
></iframe>
|
| 69 |
|
|
|
|
|
|
|
| 70 |
## 1. Steering with SAEs
|
| 71 |
|
| 72 |
### 1.1 Model steering and sparse autoencoders
|
|
|
|
| 113 |
|
| 114 |
The corresponding Neuronpedia page is included below. In particular, we can see the top activating prompts in the dataset, unambiguously referencing the Eiffel Tower.
|
| 115 |
|
| 116 |
+
<iframe src="https://www.neuronpedia.org/llama3.1-8b-it/15-resid-post-aa/21576?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 900px; width: 100%;"></iframe>
|
| 117 |
|
| 118 |
In the training dataset, the maximum activation observed for that feature was 4.77.
|
| 119 |
|
|
|
|
| 159 |
|
| 160 |
However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
|
| 161 |
|
|
|
|
| 162 |
|
| 163 |
## 2. Metrics, we need metrics!
|
| 164 |
|
|
|
|
| 241 |
We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
|
| 242 |
(For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
|
| 243 |
|
|
|
|
| 244 |
|
| 245 |
## 3. Optimizing steering coefficient for a single feature
|
| 246 |
|
|
|
|
| 334 |
|
| 335 |
Using the optimal steering coefficient $\alpha=8.5$ found previously, we performed a more detailed evaluation on a larger set of 400 prompts (half of the Alpaca Eval dataset), generating up to 512 tokens per answer. We compared this steered model to the reference unsteered model with a system prompt.
|
| 336 |
|
| 337 |
+
<HtmlEmbed src="d3-evaluation1-naive.html" data="evaluation_summary.json" />
|
|
|
|
|
|
|
| 338 |
|
| 339 |
We can see that on all metrics, **the baseline prompted model significantly outperforms the steered model.** This is consistent with the findings by AxBench that steering with SAEs is not very effective. However, our numbers are not as dire as theirs. We can see an average score in concept inclusion compared to the reference model (1.03), while maintaining a reasonable level of instruction following (1.35). However, this comes at the price of a fluency drop (0.78 vs. 1.55 for the prompted model), as fluency is impaired by repetitions (0.27) or awkward phrasing.
|
| 340 |
|
|
|
|
| 370 |
- for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
|
| 371 |
- for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
|
| 372 |
|
|
|
|
| 373 |
|
| 374 |
## 4. Steering and generation improvements
|
| 375 |
|
|
|
|
| 384 |
|
| 385 |
We tested the impact of clamping on the same steering vector at the optimal steering coefficient found previously ($\alpha=8.5$). We evaluated the model on the same set of prompts with 20 samples each and a maximum output length of 512 tokens.
|
| 386 |
|
| 387 |
+
<HtmlEmbed src="d3-evaluation2-clamp.html" data="evaluation_summary.json" />
|
|
|
|
|
|
|
| 388 |
|
| 389 |
+
We can see that **clamping has a positive effect on concept inclusion (both from the LLM score and the explicit reference), while not harming the other metrics**.
|
| 390 |
|
| 391 |
We therefore opted for clamping, in line with the choice made by Anthropic. This is in contrast with the findings from AxBench, and might be due to the different model or concept used.
|
| 392 |
|
|
|
|
| 401 |
|
| 402 |
(Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
|
| 403 |
|
| 404 |
+
|
| 405 |
|
| 406 |
## 5. Multi-Layer optimization
|
| 407 |
|
|
|
|
| 461 |
|
| 462 |
Results are shown below and compared to single-layer steering.
|
| 463 |
|
| 464 |
+
<HtmlEmbed src="d3-evaluation3-multi.html" data="evaluation_summary.json" />
|
|
|
|
|
|
|
| 465 |
|
| 466 |
As we can see on the chart, steering 2 or even 8 features simultaneously leads to **only marginal improvements** compared to steering only one feature. Although fluency and instruction following are improved, concept inclusion slightly decreases, leading to a harmonic mean that is only marginally better than single-layer steering. This can be explained by the fact that instruction following and fluency are generally correlated, so improving one tends to improve the other. Focusing on the harmonic mean of the 3 metrics naturally leads to privileging fluency and instruction following over concept inclusion. Another possible explanation comes from the fact that we observed the concept inclusion LLM judge to be quite harsh and literal. Sometimes mention of Paris or a large metal structure were not considered as valid references to the Eiffel Tower, which could explain the low concept inclusion scores.
|
| 467 |
|
| 468 |
Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. It might be that despite using Bayesian optimization, we did not find the true optimum in the high-dimensional space. Another plausible explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept. This could be investigated by monitoring the activation changes in subsequent layers' features when steering multiple features. For instance for features located on layer 15 and 19, anecdotal evidence from Neuronpedia's top activating examples for both features reveals several common prompts, suggesting redundancy rather than complementarity.
|
| 469 |
|
| 470 |
+
|
| 471 |
|
| 472 |
## 6. Conclusion & Discussion
|
| 473 |
|
|
|
|
| 496 |
- In the "prompt engineering" case, investigate the impact of prompt wording. For now the model seems to really behave like it has to check a box, rather than actually integrating the concept in a natural way. Can we make it better ? Does it shows up in the activation pattern ? For instance after mentionning the Eiffel tower, does the model activate "suppressing" features to prevent further mentions ?
|
| 497 |
</Note>
|
| 498 |
|
| 499 |
+
---
|
| 500 |
|
| 501 |
## Appendix
|
| 502 |
|
app/src/content/assets/data/evaluation_summary.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:dd5ed71a21399ec95ec3f94fa7d446790154b9333dc6933ff1993693caa89577
|
| 3 |
+
size 7531
|
app/src/content/assets/image/evaluation_summary.svg
ADDED
|
|
app/src/content/embeds/d3-evaluation-grid.html
ADDED
|
@@ -0,0 +1,467 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<div class="d3-eval-grid"></div>
|
| 2 |
+
<style>
|
| 3 |
+
.d3-eval-grid {
|
| 4 |
+
padding: 8px;
|
| 5 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
| 6 |
+
}
|
| 7 |
+
|
| 8 |
+
.d3-eval-grid .chart-card {
|
| 9 |
+
background: var(--surface-bg);
|
| 10 |
+
border: 1px solid var(--border-color);
|
| 11 |
+
border-radius: 10px;
|
| 12 |
+
padding: 16px;
|
| 13 |
+
}
|
| 14 |
+
|
| 15 |
+
.d3-eval-grid .grid-container {
|
| 16 |
+
display: grid;
|
| 17 |
+
grid-template-columns: repeat(2, 1fr);
|
| 18 |
+
gap: 24px;
|
| 19 |
+
margin-bottom: 16px;
|
| 20 |
+
}
|
| 21 |
+
|
| 22 |
+
@media (max-width: 768px) {
|
| 23 |
+
.d3-eval-grid .grid-container {
|
| 24 |
+
grid-template-columns: 1fr;
|
| 25 |
+
}
|
| 26 |
+
}
|
| 27 |
+
|
| 28 |
+
.d3-eval-grid .subplot {
|
| 29 |
+
background: var(--surface-bg);
|
| 30 |
+
border: 1px solid var(--border-color);
|
| 31 |
+
border-radius: 8px;
|
| 32 |
+
padding: 12px;
|
| 33 |
+
}
|
| 34 |
+
|
| 35 |
+
.d3-eval-grid .subplot-title {
|
| 36 |
+
font-size: 13px;
|
| 37 |
+
font-weight: 600;
|
| 38 |
+
color: var(--text-color);
|
| 39 |
+
margin-bottom: 8px;
|
| 40 |
+
text-align: center;
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
.d3-eval-grid .legend {
|
| 44 |
+
display: flex;
|
| 45 |
+
flex-wrap: wrap;
|
| 46 |
+
gap: 8px 16px;
|
| 47 |
+
padding-top: 12px;
|
| 48 |
+
border-top: 1px solid var(--border-color);
|
| 49 |
+
font-size: 12px;
|
| 50 |
+
justify-content: center;
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
.d3-eval-grid .legend-item {
|
| 54 |
+
display: flex;
|
| 55 |
+
align-items: center;
|
| 56 |
+
gap: 6px;
|
| 57 |
+
cursor: pointer;
|
| 58 |
+
transition: opacity 0.2s;
|
| 59 |
+
}
|
| 60 |
+
|
| 61 |
+
.d3-eval-grid .legend-item.dimmed {
|
| 62 |
+
opacity: 0.3;
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
.d3-eval-grid .legend-swatch {
|
| 66 |
+
width: 14px;
|
| 67 |
+
height: 14px;
|
| 68 |
+
border-radius: 3px;
|
| 69 |
+
border: 1px solid var(--border-color);
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
.d3-eval-grid .axes path,
|
| 73 |
+
.d3-eval-grid .axes line {
|
| 74 |
+
stroke: var(--axis-color);
|
| 75 |
+
}
|
| 76 |
+
|
| 77 |
+
.d3-eval-grid .axes text {
|
| 78 |
+
fill: var(--tick-color);
|
| 79 |
+
font-size: 10px;
|
| 80 |
+
}
|
| 81 |
+
|
| 82 |
+
.d3-eval-grid .grid line {
|
| 83 |
+
stroke: var(--grid-color);
|
| 84 |
+
stroke-dasharray: 2,2;
|
| 85 |
+
opacity: 0.5;
|
| 86 |
+
}
|
| 87 |
+
|
| 88 |
+
.d3-eval-grid .axis-label {
|
| 89 |
+
fill: var(--text-color);
|
| 90 |
+
font-size: 11px;
|
| 91 |
+
font-weight: 600;
|
| 92 |
+
}
|
| 93 |
+
|
| 94 |
+
.d3-eval-grid .d3-tooltip {
|
| 95 |
+
position: absolute;
|
| 96 |
+
pointer-events: none;
|
| 97 |
+
padding: 8px 10px;
|
| 98 |
+
background: var(--surface-bg);
|
| 99 |
+
border: 1px solid var(--border-color);
|
| 100 |
+
border-radius: 8px;
|
| 101 |
+
font-size: 11px;
|
| 102 |
+
line-height: 1.5;
|
| 103 |
+
box-shadow: 0 4px 24px rgba(0,0,0,.18);
|
| 104 |
+
opacity: 0;
|
| 105 |
+
transition: opacity 0.2s;
|
| 106 |
+
z-index: 1000;
|
| 107 |
+
}
|
| 108 |
+
|
| 109 |
+
.d3-eval-grid .bar {
|
| 110 |
+
transition: opacity 0.2s;
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
.d3-eval-grid .bar.dimmed {
|
| 114 |
+
opacity: 0.2;
|
| 115 |
+
}
|
| 116 |
+
</style>
|
| 117 |
+
<script>
|
| 118 |
+
(() => {
|
| 119 |
+
const ensureD3 = (cb) => {
|
| 120 |
+
if (window.d3 && typeof window.d3.select === 'function') return cb();
|
| 121 |
+
let s = document.getElementById('d3-cdn-script');
|
| 122 |
+
if (!s) {
|
| 123 |
+
s = document.createElement('script');
|
| 124 |
+
s.id = 'd3-cdn-script';
|
| 125 |
+
s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
|
| 126 |
+
document.head.appendChild(s);
|
| 127 |
+
}
|
| 128 |
+
s.addEventListener('load', () => {
|
| 129 |
+
if (window.d3 && typeof window.d3.select === 'function') cb();
|
| 130 |
+
}, { once: true });
|
| 131 |
+
};
|
| 132 |
+
|
| 133 |
+
const bootstrap = () => {
|
| 134 |
+
const scriptEl = document.currentScript;
|
| 135 |
+
let container = scriptEl ? scriptEl.previousElementSibling : null;
|
| 136 |
+
if (!(container && container.classList && container.classList.contains('d3-eval-grid'))) {
|
| 137 |
+
const candidates = Array.from(document.querySelectorAll('.d3-eval-grid'))
|
| 138 |
+
.filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
|
| 139 |
+
container = candidates[candidates.length - 1] || null;
|
| 140 |
+
}
|
| 141 |
+
if (!container) return;
|
| 142 |
+
if (container.dataset) {
|
| 143 |
+
if (container.dataset.mounted === 'true') return;
|
| 144 |
+
container.dataset.mounted = 'true';
|
| 145 |
+
}
|
| 146 |
+
|
| 147 |
+
// Find data attribute
|
| 148 |
+
let mountEl = container;
|
| 149 |
+
while (mountEl && !mountEl.getAttribute?.('data-datafiles')) {
|
| 150 |
+
mountEl = mountEl.parentElement;
|
| 151 |
+
}
|
| 152 |
+
let providedData = null;
|
| 153 |
+
try {
|
| 154 |
+
const attr = mountEl && mountEl.getAttribute ? mountEl.getAttribute('data-datafiles') : null;
|
| 155 |
+
if (attr && attr.trim()) {
|
| 156 |
+
providedData = attr.trim().startsWith('[') ? JSON.parse(attr) : attr.trim();
|
| 157 |
+
}
|
| 158 |
+
} catch(_) {}
|
| 159 |
+
|
| 160 |
+
// Check for experiments filter attribute
|
| 161 |
+
let experimentsFilter = null;
|
| 162 |
+
try {
|
| 163 |
+
const expAttr = container.getAttribute('data-experiments');
|
| 164 |
+
if (expAttr) {
|
| 165 |
+
experimentsFilter = JSON.parse(expAttr);
|
| 166 |
+
}
|
| 167 |
+
} catch(_) {}
|
| 168 |
+
|
| 169 |
+
const DEFAULT_JSON = '/data/evaluation_summary.json';
|
| 170 |
+
const ensureDataPrefix = (p) => (typeof p === 'string' && p && !p.includes('/')) ? `/data/${p}` : p;
|
| 171 |
+
|
| 172 |
+
const JSON_PATHS = typeof providedData === 'string'
|
| 173 |
+
? [ensureDataPrefix(providedData)]
|
| 174 |
+
: [
|
| 175 |
+
DEFAULT_JSON,
|
| 176 |
+
'./assets/data/evaluation_summary.json',
|
| 177 |
+
'../assets/data/evaluation_summary.json',
|
| 178 |
+
'../../assets/data/evaluation_summary.json'
|
| 179 |
+
];
|
| 180 |
+
|
| 181 |
+
const fetchFirstAvailable = async (paths) => {
|
| 182 |
+
for (const p of paths) {
|
| 183 |
+
try {
|
| 184 |
+
const r = await fetch(p, { cache: 'no-cache' });
|
| 185 |
+
if (r.ok) return await r.json();
|
| 186 |
+
} catch(_){}
|
| 187 |
+
}
|
| 188 |
+
throw new Error('JSON not found');
|
| 189 |
+
};
|
| 190 |
+
|
| 191 |
+
fetchFirstAvailable(JSON_PATHS)
|
| 192 |
+
.then(rawData => {
|
| 193 |
+
// All experiments in order
|
| 194 |
+
const allExperiments = ['Prompt', 'Basic steering', 'Clamping', 'Clamping + Penalty', '2D optimized', '8D optimized'];
|
| 195 |
+
|
| 196 |
+
// Use filtered experiments if provided, otherwise use all
|
| 197 |
+
const experiments = experimentsFilter || allExperiments;
|
| 198 |
+
|
| 199 |
+
// Metrics in 2x3 grid layout
|
| 200 |
+
const metrics = [
|
| 201 |
+
{ key: 'llm_score_concept', label: 'LLM Concept Score', format: d3.format('.2f') },
|
| 202 |
+
{ key: 'llm_score_instruction', label: 'LLM Instruction Score', format: d3.format('.2f') },
|
| 203 |
+
{ key: 'llm_score_fluency', label: 'LLM Fluency Score', format: d3.format('.2f') },
|
| 204 |
+
{ key: 'rep3', label: '3-gram Repetition Fraction', format: d3.format('.2f') },
|
| 205 |
+
{ key: 'mean_llm_score', label: 'Mean LLM Score', format: d3.format('.2f') },
|
| 206 |
+
{ key: 'harmonic_llm_score', label: 'Harmonic Mean LLM Score', format: d3.format('.2f') }
|
| 207 |
+
];
|
| 208 |
+
|
| 209 |
+
// Restructure data
|
| 210 |
+
const data = {};
|
| 211 |
+
rawData.forEach(d => {
|
| 212 |
+
if (!data[d.metric]) data[d.metric] = {};
|
| 213 |
+
data[d.metric][d.experiment] = { mean: d.mean, std: d.std };
|
| 214 |
+
});
|
| 215 |
+
|
| 216 |
+
// Color palette - consistent across all charts
|
| 217 |
+
const allColors = {
|
| 218 |
+
'Prompt': '#4c4c4c',
|
| 219 |
+
'Basic steering': '#b2b2b2',
|
| 220 |
+
'Clamping': '#b2b2cc',
|
| 221 |
+
'Clamping + Penalty': '#b2b2e6',
|
| 222 |
+
'2D optimized': '#b2ffb2',
|
| 223 |
+
'8D optimized': '#ffb2ff'
|
| 224 |
+
};
|
| 225 |
+
|
| 226 |
+
const card = document.createElement('div');
|
| 227 |
+
card.className = 'chart-card';
|
| 228 |
+
container.appendChild(card);
|
| 229 |
+
|
| 230 |
+
const gridContainer = document.createElement('div');
|
| 231 |
+
gridContainer.className = 'grid-container';
|
| 232 |
+
card.appendChild(gridContainer);
|
| 233 |
+
|
| 234 |
+
// Tooltip
|
| 235 |
+
const tooltip = d3.select(card).append('div')
|
| 236 |
+
.attr('class', 'd3-tooltip')
|
| 237 |
+
.style('transform', 'translate(-9999px, -9999px)');
|
| 238 |
+
|
| 239 |
+
let hoveredExperiment = null;
|
| 240 |
+
|
| 241 |
+
// Create each subplot
|
| 242 |
+
metrics.forEach((metric, idx) => {
|
| 243 |
+
const subplot = document.createElement('div');
|
| 244 |
+
subplot.className = 'subplot';
|
| 245 |
+
subplot.dataset.metric = metric.key;
|
| 246 |
+
gridContainer.appendChild(subplot);
|
| 247 |
+
|
| 248 |
+
const title = document.createElement('div');
|
| 249 |
+
title.className = 'subplot-title';
|
| 250 |
+
title.textContent = metric.label;
|
| 251 |
+
subplot.appendChild(title);
|
| 252 |
+
|
| 253 |
+
const svg = d3.select(subplot).append('svg')
|
| 254 |
+
.attr('width', '100%')
|
| 255 |
+
.style('display', 'block');
|
| 256 |
+
|
| 257 |
+
const g = svg.append('g');
|
| 258 |
+
const gGrid = g.append('g').attr('class', 'grid');
|
| 259 |
+
const gBars = g.append('g').attr('class', 'bars');
|
| 260 |
+
const gErrorBars = g.append('g').attr('class', 'error-bars');
|
| 261 |
+
const gAxes = g.append('g').attr('class', 'axes');
|
| 262 |
+
|
| 263 |
+
subplot._render = () => {
|
| 264 |
+
const width = subplot.clientWidth || 300;
|
| 265 |
+
const height = Math.max(200, Math.round(width * 0.6));
|
| 266 |
+
const margin = { top: 10, right: 10, bottom: 60, left: 50 };
|
| 267 |
+
const innerWidth = width - margin.left - margin.right;
|
| 268 |
+
const innerHeight = height - margin.top - margin.bottom;
|
| 269 |
+
|
| 270 |
+
svg.attr('height', height);
|
| 271 |
+
g.attr('transform', `translate(${margin.left},${margin.top})`);
|
| 272 |
+
|
| 273 |
+
// Scales
|
| 274 |
+
const x = d3.scaleBand()
|
| 275 |
+
.domain(experiments)
|
| 276 |
+
.range([0, innerWidth])
|
| 277 |
+
.padding(0.2);
|
| 278 |
+
|
| 279 |
+
// Find y domain for this metric
|
| 280 |
+
const values = experiments.map(exp => data[metric.key]?.[exp]?.mean).filter(v => v !== undefined);
|
| 281 |
+
const stds = experiments.map(exp => data[metric.key]?.[exp]?.std).filter(v => v !== undefined);
|
| 282 |
+
const maxVal = d3.max(values.map((v, i) => v + stds[i]));
|
| 283 |
+
const minVal = d3.min(values.map((v, i) => Math.max(0, v - stds[i])));
|
| 284 |
+
|
| 285 |
+
const y = d3.scaleLinear()
|
| 286 |
+
.domain([Math.max(0, minVal * 0.95), maxVal * 1.05])
|
| 287 |
+
.range([innerHeight, 0])
|
| 288 |
+
.nice();
|
| 289 |
+
|
| 290 |
+
// Grid
|
| 291 |
+
gGrid.selectAll('*').remove();
|
| 292 |
+
gGrid.selectAll('line')
|
| 293 |
+
.data(y.ticks(4))
|
| 294 |
+
.join('line')
|
| 295 |
+
.attr('x1', 0)
|
| 296 |
+
.attr('x2', innerWidth)
|
| 297 |
+
.attr('y1', d => y(d))
|
| 298 |
+
.attr('y2', d => y(d));
|
| 299 |
+
|
| 300 |
+
// Axes
|
| 301 |
+
gAxes.selectAll('*').remove();
|
| 302 |
+
|
| 303 |
+
const xAxis = gAxes.append('g')
|
| 304 |
+
.attr('transform', `translate(0,${innerHeight})`)
|
| 305 |
+
.call(d3.axisBottom(x).tickSize(3));
|
| 306 |
+
|
| 307 |
+
xAxis.selectAll('text')
|
| 308 |
+
.attr('transform', 'rotate(-45)')
|
| 309 |
+
.style('text-anchor', 'end')
|
| 310 |
+
.attr('dx', '-0.5em')
|
| 311 |
+
.attr('dy', '0.15em');
|
| 312 |
+
|
| 313 |
+
gAxes.append('g')
|
| 314 |
+
.call(d3.axisLeft(y).ticks(4).tickFormat(metric.format).tickSize(3));
|
| 315 |
+
|
| 316 |
+
// Draw bars
|
| 317 |
+
const bars = [];
|
| 318 |
+
experiments.forEach(exp => {
|
| 319 |
+
const d = data[metric.key]?.[exp];
|
| 320 |
+
if (d) {
|
| 321 |
+
bars.push({
|
| 322 |
+
experiment: exp,
|
| 323 |
+
mean: d.mean,
|
| 324 |
+
std: d.std,
|
| 325 |
+
color: allColors[exp],
|
| 326 |
+
x: x(exp),
|
| 327 |
+
y: y(d.mean),
|
| 328 |
+
width: x.bandwidth(),
|
| 329 |
+
height: innerHeight - y(d.mean)
|
| 330 |
+
});
|
| 331 |
+
}
|
| 332 |
+
});
|
| 333 |
+
|
| 334 |
+
gBars.selectAll('rect')
|
| 335 |
+
.data(bars)
|
| 336 |
+
.join('rect')
|
| 337 |
+
.attr('class', 'bar')
|
| 338 |
+
.attr('x', d => d.x)
|
| 339 |
+
.attr('y', d => d.y)
|
| 340 |
+
.attr('width', d => d.width)
|
| 341 |
+
.attr('height', d => d.height)
|
| 342 |
+
.attr('fill', d => d.color)
|
| 343 |
+
.attr('rx', 2)
|
| 344 |
+
.classed('dimmed', d => hoveredExperiment && d.experiment !== hoveredExperiment)
|
| 345 |
+
.on('mouseenter', (event, d) => {
|
| 346 |
+
hoveredExperiment = d.experiment;
|
| 347 |
+
updateAll();
|
| 348 |
+
tooltip
|
| 349 |
+
.style('opacity', 1)
|
| 350 |
+
.html(`
|
| 351 |
+
<div><strong>${d.experiment}</strong></div>
|
| 352 |
+
<div style="margin-top: 4px;">${metric.label}</div>
|
| 353 |
+
<div style="margin-top: 4px;"><strong>Mean:</strong> ${metric.format(d.mean)}</div>
|
| 354 |
+
<div><strong>Std:</strong> ${metric.format(d.std)}</div>
|
| 355 |
+
`);
|
| 356 |
+
})
|
| 357 |
+
.on('mousemove', (event) => {
|
| 358 |
+
const [mx, my] = d3.pointer(event, card);
|
| 359 |
+
tooltip.style('transform', `translate(${mx + 10}px, ${my + 10}px)`);
|
| 360 |
+
})
|
| 361 |
+
.on('mouseleave', () => {
|
| 362 |
+
hoveredExperiment = null;
|
| 363 |
+
updateAll();
|
| 364 |
+
tooltip.style('opacity', 0).style('transform', 'translate(-9999px, -9999px)');
|
| 365 |
+
});
|
| 366 |
+
|
| 367 |
+
// Error bars
|
| 368 |
+
gErrorBars.selectAll('line')
|
| 369 |
+
.data(bars)
|
| 370 |
+
.join('line')
|
| 371 |
+
.attr('x1', d => d.x + d.width / 2)
|
| 372 |
+
.attr('x2', d => d.x + d.width / 2)
|
| 373 |
+
.attr('y1', d => y(d.mean + d.std))
|
| 374 |
+
.attr('y2', d => y(Math.max(0, d.mean - d.std)))
|
| 375 |
+
.attr('stroke', '#666')
|
| 376 |
+
.attr('stroke-width', 1.5)
|
| 377 |
+
.attr('opacity', 0.6);
|
| 378 |
+
|
| 379 |
+
// Error bar caps
|
| 380 |
+
gErrorBars.selectAll('.cap-top')
|
| 381 |
+
.data(bars)
|
| 382 |
+
.join('line')
|
| 383 |
+
.attr('class', 'cap-top')
|
| 384 |
+
.attr('x1', d => d.x + d.width / 2 - 3)
|
| 385 |
+
.attr('x2', d => d.x + d.width / 2 + 3)
|
| 386 |
+
.attr('y1', d => y(d.mean + d.std))
|
| 387 |
+
.attr('y2', d => y(d.mean + d.std))
|
| 388 |
+
.attr('stroke', '#666')
|
| 389 |
+
.attr('stroke-width', 1.5)
|
| 390 |
+
.attr('opacity', 0.6);
|
| 391 |
+
|
| 392 |
+
gErrorBars.selectAll('.cap-bottom')
|
| 393 |
+
.data(bars)
|
| 394 |
+
.join('line')
|
| 395 |
+
.attr('class', 'cap-bottom')
|
| 396 |
+
.attr('x1', d => d.x + d.width / 2 - 3)
|
| 397 |
+
.attr('x2', d => d.x + d.width / 2 + 3)
|
| 398 |
+
.attr('y1', d => y(Math.max(0, d.mean - d.std)))
|
| 399 |
+
.attr('y2', d => y(Math.max(0, d.mean - d.std)))
|
| 400 |
+
.attr('stroke', '#666')
|
| 401 |
+
.attr('stroke-width', 1.5)
|
| 402 |
+
.attr('opacity', 0.6);
|
| 403 |
+
};
|
| 404 |
+
});
|
| 405 |
+
|
| 406 |
+
// Legend
|
| 407 |
+
const legend = document.createElement('div');
|
| 408 |
+
legend.className = 'legend';
|
| 409 |
+
experiments.forEach(exp => {
|
| 410 |
+
const item = document.createElement('div');
|
| 411 |
+
item.className = 'legend-item';
|
| 412 |
+
item.dataset.experiment = exp;
|
| 413 |
+
item.innerHTML = `
|
| 414 |
+
<div class="legend-swatch" style="background: ${allColors[exp]}"></div>
|
| 415 |
+
<span>${exp}</span>
|
| 416 |
+
`;
|
| 417 |
+
legend.appendChild(item);
|
| 418 |
+
});
|
| 419 |
+
card.appendChild(legend);
|
| 420 |
+
|
| 421 |
+
// Legend interaction
|
| 422 |
+
legend.querySelectorAll('.legend-item').forEach(item => {
|
| 423 |
+
item.addEventListener('mouseenter', () => {
|
| 424 |
+
hoveredExperiment = item.dataset.experiment;
|
| 425 |
+
updateAll();
|
| 426 |
+
});
|
| 427 |
+
item.addEventListener('mouseleave', () => {
|
| 428 |
+
hoveredExperiment = null;
|
| 429 |
+
updateAll();
|
| 430 |
+
});
|
| 431 |
+
});
|
| 432 |
+
|
| 433 |
+
const updateAll = () => {
|
| 434 |
+
gridContainer.querySelectorAll('.subplot').forEach(subplot => {
|
| 435 |
+
if (subplot._render) subplot._render();
|
| 436 |
+
});
|
| 437 |
+
|
| 438 |
+
legend.querySelectorAll('.legend-item').forEach(item => {
|
| 439 |
+
if (hoveredExperiment && item.dataset.experiment !== hoveredExperiment) {
|
| 440 |
+
item.classList.add('dimmed');
|
| 441 |
+
} else {
|
| 442 |
+
item.classList.remove('dimmed');
|
| 443 |
+
}
|
| 444 |
+
});
|
| 445 |
+
};
|
| 446 |
+
|
| 447 |
+
updateAll();
|
| 448 |
+
|
| 449 |
+
if (window.ResizeObserver) {
|
| 450 |
+
const ro = new ResizeObserver(() => updateAll());
|
| 451 |
+
ro.observe(container);
|
| 452 |
+
} else {
|
| 453 |
+
window.addEventListener('resize', updateAll);
|
| 454 |
+
}
|
| 455 |
+
})
|
| 456 |
+
.catch(err => {
|
| 457 |
+
container.innerHTML = `<div style="color: red; padding: 20px;">Error: ${err.message}</div>`;
|
| 458 |
+
});
|
| 459 |
+
};
|
| 460 |
+
|
| 461 |
+
if (document.readyState === 'loading') {
|
| 462 |
+
document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
|
| 463 |
+
} else {
|
| 464 |
+
ensureD3(bootstrap);
|
| 465 |
+
}
|
| 466 |
+
})();
|
| 467 |
+
</script>
|
app/src/content/embeds/d3-evaluation1-naive.html
ADDED
|
@@ -0,0 +1,414 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<div class="d3-eval-grid d3-eval-grid-1"></div>
|
| 2 |
+
<style>
|
| 3 |
+
.d3-eval-grid {
|
| 4 |
+
padding: 2px;
|
| 5 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
| 6 |
+
}
|
| 7 |
+
|
| 8 |
+
.d3-eval-grid .grid-container {
|
| 9 |
+
display: grid;
|
| 10 |
+
grid-template-columns: repeat(2, 1fr);
|
| 11 |
+
gap: 8px;
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
@media (max-width: 768px) {
|
| 15 |
+
.d3-eval-grid .grid-container {
|
| 16 |
+
grid-template-columns: 1fr;
|
| 17 |
+
}
|
| 18 |
+
}
|
| 19 |
+
|
| 20 |
+
.d3-eval-grid .subplot {
|
| 21 |
+
padding: 4px;
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
.d3-eval-grid .subplot-title {
|
| 25 |
+
font-size: 12px;
|
| 26 |
+
font-weight: 600;
|
| 27 |
+
color: var(--text-color);
|
| 28 |
+
margin-bottom: 4px;
|
| 29 |
+
text-align: center;
|
| 30 |
+
}
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
.d3-eval-grid .axes path,
|
| 34 |
+
.d3-eval-grid .axes line {
|
| 35 |
+
stroke: var(--axis-color);
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
.d3-eval-grid .axes text {
|
| 39 |
+
fill: var(--tick-color);
|
| 40 |
+
font-size: 9px;
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
.d3-eval-grid .grid line {
|
| 44 |
+
stroke: var(--grid-color);
|
| 45 |
+
stroke-dasharray: 2,2;
|
| 46 |
+
opacity: 0.5;
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
.d3-eval-grid .axis-label {
|
| 50 |
+
fill: var(--text-color);
|
| 51 |
+
font-size: 11px;
|
| 52 |
+
font-weight: 600;
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
.d3-eval-grid .d3-tooltip {
|
| 56 |
+
position: absolute;
|
| 57 |
+
pointer-events: none;
|
| 58 |
+
padding: 8px 10px;
|
| 59 |
+
background: var(--surface-bg);
|
| 60 |
+
border: 1px solid var(--border-color);
|
| 61 |
+
border-radius: 8px;
|
| 62 |
+
font-size: 11px;
|
| 63 |
+
line-height: 1.5;
|
| 64 |
+
box-shadow: 0 4px 24px rgba(0,0,0,.18);
|
| 65 |
+
opacity: 0;
|
| 66 |
+
transition: opacity 0.2s;
|
| 67 |
+
z-index: 1000;
|
| 68 |
+
}
|
| 69 |
+
|
| 70 |
+
.d3-eval-grid .bar {
|
| 71 |
+
transition: opacity 0.2s;
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
.d3-eval-grid .bar.dimmed {
|
| 75 |
+
opacity: 0.2;
|
| 76 |
+
}
|
| 77 |
+
</style>
|
| 78 |
+
<script>
|
| 79 |
+
(() => {
|
| 80 |
+
const ensureD3 = (cb) => {
|
| 81 |
+
if (window.d3 && typeof window.d3.select === 'function') return cb();
|
| 82 |
+
let s = document.getElementById('d3-cdn-script');
|
| 83 |
+
if (!s) {
|
| 84 |
+
s = document.createElement('script');
|
| 85 |
+
s.id = 'd3-cdn-script';
|
| 86 |
+
s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
|
| 87 |
+
document.head.appendChild(s);
|
| 88 |
+
}
|
| 89 |
+
s.addEventListener('load', () => {
|
| 90 |
+
if (window.d3 && typeof window.d3.select === 'function') cb();
|
| 91 |
+
}, { once: true });
|
| 92 |
+
};
|
| 93 |
+
|
| 94 |
+
const bootstrap = () => {
|
| 95 |
+
const scriptEl = document.currentScript;
|
| 96 |
+
let container = scriptEl ? scriptEl.previousElementSibling : null;
|
| 97 |
+
if (!(container && container.classList && container.classList.contains('d3-eval-grid-1'))) {
|
| 98 |
+
const candidates = Array.from(document.querySelectorAll('.d3-eval-grid-1'))
|
| 99 |
+
.filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
|
| 100 |
+
container = candidates[candidates.length - 1] || null;
|
| 101 |
+
}
|
| 102 |
+
if (!container) return;
|
| 103 |
+
if (container.dataset) {
|
| 104 |
+
if (container.dataset.mounted === 'true') return;
|
| 105 |
+
container.dataset.mounted = 'true';
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
// Find data attribute
|
| 109 |
+
let mountEl = container;
|
| 110 |
+
while (mountEl && !mountEl.getAttribute?.('data-datafiles')) {
|
| 111 |
+
mountEl = mountEl.parentElement;
|
| 112 |
+
}
|
| 113 |
+
let providedData = null;
|
| 114 |
+
try {
|
| 115 |
+
const attr = mountEl && mountEl.getAttribute ? mountEl.getAttribute('data-datafiles') : null;
|
| 116 |
+
if (attr && attr.trim()) {
|
| 117 |
+
providedData = attr.trim().startsWith('[') ? JSON.parse(attr) : attr.trim();
|
| 118 |
+
}
|
| 119 |
+
} catch(_) {}
|
| 120 |
+
|
| 121 |
+
// Check for experiments filter attribute
|
| 122 |
+
let experimentsFilter = null;
|
| 123 |
+
try {
|
| 124 |
+
const expAttr = container.getAttribute('data-experiments');
|
| 125 |
+
if (expAttr) {
|
| 126 |
+
experimentsFilter = JSON.parse(expAttr);
|
| 127 |
+
}
|
| 128 |
+
} catch(_) {}
|
| 129 |
+
|
| 130 |
+
const DEFAULT_JSON = '/data/evaluation_summary.json';
|
| 131 |
+
const ensureDataPrefix = (p) => (typeof p === 'string' && p && !p.includes('/')) ? `/data/${p}` : p;
|
| 132 |
+
|
| 133 |
+
const JSON_PATHS = typeof providedData === 'string'
|
| 134 |
+
? [ensureDataPrefix(providedData)]
|
| 135 |
+
: [
|
| 136 |
+
DEFAULT_JSON,
|
| 137 |
+
'./assets/data/evaluation_summary.json',
|
| 138 |
+
'../assets/data/evaluation_summary.json',
|
| 139 |
+
'../../assets/data/evaluation_summary.json'
|
| 140 |
+
];
|
| 141 |
+
|
| 142 |
+
const fetchFirstAvailable = async (paths) => {
|
| 143 |
+
for (const p of paths) {
|
| 144 |
+
try {
|
| 145 |
+
const r = await fetch(p, { cache: 'no-cache' });
|
| 146 |
+
if (r.ok) return await r.json();
|
| 147 |
+
} catch(_){}
|
| 148 |
+
}
|
| 149 |
+
throw new Error('JSON not found');
|
| 150 |
+
};
|
| 151 |
+
|
| 152 |
+
fetchFirstAvailable(JSON_PATHS)
|
| 153 |
+
.then(rawData => {
|
| 154 |
+
// Chart 1: Only Prompt and Basic steering (but reserve space for all)
|
| 155 |
+
const allExperiments = ['Prompt', 'Basic steering', 'Clamping', 'Clamping + Penalty', '2D optimized', '8D optimized'];
|
| 156 |
+
const visibleExperiments = ['Prompt', 'Basic steering'];
|
| 157 |
+
|
| 158 |
+
// Metrics in 2x4 grid layout (8 metrics)
|
| 159 |
+
const metrics = [
|
| 160 |
+
{ key: 'llm_score_concept', label: 'LLM Concept Score', format: d3.format('.2f') },
|
| 161 |
+
{ key: 'eiffel', label: 'Explicit Concept Presence', format: d3.format('.2f') },
|
| 162 |
+
{ key: 'llm_score_instruction', label: 'LLM Instruction Score', format: d3.format('.2f') },
|
| 163 |
+
{ key: 'minus_log_prob', label: 'Surprise in Original Model', format: d3.format('.2f') },
|
| 164 |
+
{ key: 'llm_score_fluency', label: 'LLM Fluency Score', format: d3.format('.2f') },
|
| 165 |
+
{ key: 'rep3', label: '3-gram Repetition Fraction', format: d3.format('.2f') },
|
| 166 |
+
{ key: 'mean_llm_score', label: 'Mean LLM Score', format: d3.format('.2f') },
|
| 167 |
+
{ key: 'harmonic_llm_score', label: 'Harmonic Mean LLM Score', format: d3.format('.2f') }
|
| 168 |
+
];
|
| 169 |
+
|
| 170 |
+
// Restructure data
|
| 171 |
+
const data = {};
|
| 172 |
+
rawData.forEach(d => {
|
| 173 |
+
if (!data[d.metric]) data[d.metric] = {};
|
| 174 |
+
data[d.metric][d.experiment] = { mean: d.mean, std: d.std };
|
| 175 |
+
});
|
| 176 |
+
|
| 177 |
+
// Color palette - consistent across all charts
|
| 178 |
+
const allColors = {
|
| 179 |
+
'Prompt': '#4c4c4c',
|
| 180 |
+
'Basic steering': '#b2b2b2',
|
| 181 |
+
'Clamping': '#b2b2cc',
|
| 182 |
+
'Clamping + Penalty': '#b2b2e6',
|
| 183 |
+
'2D optimized': '#b2ffb2',
|
| 184 |
+
'8D optimized': '#ffb2ff'
|
| 185 |
+
};
|
| 186 |
+
|
| 187 |
+
const gridContainer = document.createElement('div');
|
| 188 |
+
gridContainer.className = 'grid-container';
|
| 189 |
+
container.appendChild(gridContainer);
|
| 190 |
+
|
| 191 |
+
// Tooltip
|
| 192 |
+
const tooltip = d3.select(container).append('div')
|
| 193 |
+
.attr('class', 'd3-tooltip')
|
| 194 |
+
.style('transform', 'translate(-9999px, -9999px)');
|
| 195 |
+
|
| 196 |
+
let hoveredExperiment = null;
|
| 197 |
+
|
| 198 |
+
// Create each subplot
|
| 199 |
+
metrics.forEach((metric, idx) => {
|
| 200 |
+
const subplot = document.createElement('div');
|
| 201 |
+
subplot.className = 'subplot';
|
| 202 |
+
subplot.dataset.metric = metric.key;
|
| 203 |
+
gridContainer.appendChild(subplot);
|
| 204 |
+
|
| 205 |
+
const title = document.createElement('div');
|
| 206 |
+
title.className = 'subplot-title';
|
| 207 |
+
title.textContent = metric.label;
|
| 208 |
+
subplot.appendChild(title);
|
| 209 |
+
|
| 210 |
+
const svg = d3.select(subplot).append('svg')
|
| 211 |
+
.attr('width', '100%')
|
| 212 |
+
.style('display', 'block');
|
| 213 |
+
|
| 214 |
+
const g = svg.append('g');
|
| 215 |
+
const gGrid = g.append('g').attr('class', 'grid');
|
| 216 |
+
const gBars = g.append('g').attr('class', 'bars');
|
| 217 |
+
const gErrorBars = g.append('g').attr('class', 'error-bars');
|
| 218 |
+
const gAxes = g.append('g').attr('class', 'axes');
|
| 219 |
+
const gLabels = g.append('g').attr('class', 'value-labels');
|
| 220 |
+
|
| 221 |
+
subplot._render = () => {
|
| 222 |
+
const width = subplot.clientWidth || 300;
|
| 223 |
+
const height = Math.max(200, Math.round(width * 0.6));
|
| 224 |
+
const margin = { top: 10, right: 20, bottom: 70, left: 42 };
|
| 225 |
+
const innerWidth = width - margin.left - margin.right;
|
| 226 |
+
const innerHeight = height - margin.top - margin.bottom;
|
| 227 |
+
|
| 228 |
+
svg.attr('height', height);
|
| 229 |
+
g.attr('transform', `translate(${margin.left},${margin.top})`);
|
| 230 |
+
|
| 231 |
+
// Scales - use all experiments for consistent positioning
|
| 232 |
+
const x = d3.scaleBand()
|
| 233 |
+
.domain(allExperiments)
|
| 234 |
+
.range([0, innerWidth])
|
| 235 |
+
.padding(0.2);
|
| 236 |
+
|
| 237 |
+
// Fixed y-axis ranges based on metric type
|
| 238 |
+
const yDomains = {
|
| 239 |
+
'llm_score_concept': [0, 2],
|
| 240 |
+
'llm_score_instruction': [0, 2],
|
| 241 |
+
'llm_score_fluency': [0, 2],
|
| 242 |
+
'mean_llm_score': [0, 2],
|
| 243 |
+
'harmonic_llm_score': [0, 2],
|
| 244 |
+
'eiffel': [0, 1],
|
| 245 |
+
'minus_log_prob': [0, 2],
|
| 246 |
+
'rep3': [0, 0.5]
|
| 247 |
+
};
|
| 248 |
+
|
| 249 |
+
const y = d3.scaleLinear()
|
| 250 |
+
.domain(yDomains[metric.key] || [0, 1])
|
| 251 |
+
.range([innerHeight, 0]);
|
| 252 |
+
|
| 253 |
+
// Grid
|
| 254 |
+
gGrid.selectAll('*').remove();
|
| 255 |
+
gGrid.selectAll('line')
|
| 256 |
+
.data(y.ticks(4))
|
| 257 |
+
.join('line')
|
| 258 |
+
.attr('x1', 0)
|
| 259 |
+
.attr('x2', innerWidth)
|
| 260 |
+
.attr('y1', d => y(d))
|
| 261 |
+
.attr('y2', d => y(d));
|
| 262 |
+
|
| 263 |
+
// Axes
|
| 264 |
+
gAxes.selectAll('*').remove();
|
| 265 |
+
|
| 266 |
+
const xAxis = gAxes.append('g')
|
| 267 |
+
.attr('transform', `translate(0,${innerHeight})`)
|
| 268 |
+
.call(d3.axisBottom(x).tickSize(3));
|
| 269 |
+
|
| 270 |
+
// Only show labels for visible experiments
|
| 271 |
+
xAxis.selectAll('text')
|
| 272 |
+
.attr('transform', 'rotate(-45)')
|
| 273 |
+
.style('text-anchor', 'end')
|
| 274 |
+
.attr('dx', '-0.5em')
|
| 275 |
+
.attr('dy', '0.15em')
|
| 276 |
+
.style('opacity', function() {
|
| 277 |
+
const text = d3.select(this).text();
|
| 278 |
+
return visibleExperiments.includes(text) ? 1 : 0;
|
| 279 |
+
});
|
| 280 |
+
|
| 281 |
+
gAxes.append('g')
|
| 282 |
+
.call(d3.axisLeft(y).ticks(4).tickFormat(metric.format).tickSize(3));
|
| 283 |
+
|
| 284 |
+
// Draw bars (only for visible experiments)
|
| 285 |
+
const bars = [];
|
| 286 |
+
visibleExperiments.forEach(exp => {
|
| 287 |
+
const d = data[metric.key]?.[exp];
|
| 288 |
+
if (d) {
|
| 289 |
+
bars.push({
|
| 290 |
+
experiment: exp,
|
| 291 |
+
mean: d.mean,
|
| 292 |
+
std: d.std,
|
| 293 |
+
color: allColors[exp],
|
| 294 |
+
x: x(exp),
|
| 295 |
+
y: y(d.mean),
|
| 296 |
+
width: x.bandwidth(),
|
| 297 |
+
height: innerHeight - y(d.mean)
|
| 298 |
+
});
|
| 299 |
+
}
|
| 300 |
+
});
|
| 301 |
+
|
| 302 |
+
gBars.selectAll('rect')
|
| 303 |
+
.data(bars)
|
| 304 |
+
.join('rect')
|
| 305 |
+
.attr('class', 'bar')
|
| 306 |
+
.attr('x', d => d.x)
|
| 307 |
+
.attr('y', d => d.y)
|
| 308 |
+
.attr('width', d => d.width)
|
| 309 |
+
.attr('height', d => d.height)
|
| 310 |
+
.attr('fill', d => d.color)
|
| 311 |
+
.attr('rx', 2)
|
| 312 |
+
.classed('dimmed', d => hoveredExperiment && d.experiment !== hoveredExperiment)
|
| 313 |
+
.on('mouseenter', (event, d) => {
|
| 314 |
+
hoveredExperiment = d.experiment;
|
| 315 |
+
|
| 316 |
+
// Show value label on bar
|
| 317 |
+
gLabels.selectAll('text').remove();
|
| 318 |
+
gLabels.append('text')
|
| 319 |
+
.attr('x', d.x + d.width / 2)
|
| 320 |
+
.attr('y', d.y - 5)
|
| 321 |
+
.attr('text-anchor', 'middle')
|
| 322 |
+
.attr('fill', 'var(--text-color)')
|
| 323 |
+
.attr('font-size', '11px')
|
| 324 |
+
.attr('font-weight', '600')
|
| 325 |
+
.text(metric.format(d.mean));
|
| 326 |
+
|
| 327 |
+
updateAll();
|
| 328 |
+
tooltip
|
| 329 |
+
.style('opacity', 1)
|
| 330 |
+
.html(`
|
| 331 |
+
<div><strong>${d.experiment}</strong></div>
|
| 332 |
+
<div style="margin-top: 4px;">${metric.label}</div>
|
| 333 |
+
<div style="margin-top: 4px;"><strong>Mean:</strong> ${metric.format(d.mean)}</div>
|
| 334 |
+
<div><strong>Std:</strong> ${metric.format(d.std)}</div>
|
| 335 |
+
`);
|
| 336 |
+
})
|
| 337 |
+
.on('mousemove', (event) => {
|
| 338 |
+
const [mx, my] = d3.pointer(event, container);
|
| 339 |
+
tooltip.style('transform', `translate(${mx + 10}px, ${my + 10}px)`);
|
| 340 |
+
})
|
| 341 |
+
.on('mouseleave', () => {
|
| 342 |
+
hoveredExperiment = null;
|
| 343 |
+
gLabels.selectAll('text').remove();
|
| 344 |
+
updateAll();
|
| 345 |
+
tooltip.style('opacity', 0).style('transform', 'translate(-9999px, -9999px)');
|
| 346 |
+
});
|
| 347 |
+
|
| 348 |
+
// Error bars
|
| 349 |
+
gErrorBars.selectAll('line')
|
| 350 |
+
.data(bars)
|
| 351 |
+
.join('line')
|
| 352 |
+
.attr('x1', d => d.x + d.width / 2)
|
| 353 |
+
.attr('x2', d => d.x + d.width / 2)
|
| 354 |
+
.attr('y1', d => y(d.mean + d.std))
|
| 355 |
+
.attr('y2', d => y(Math.max(0, d.mean - d.std)))
|
| 356 |
+
.attr('stroke', '#666')
|
| 357 |
+
.attr('stroke-width', 1.5)
|
| 358 |
+
.attr('opacity', 0.6);
|
| 359 |
+
|
| 360 |
+
// Error bar caps
|
| 361 |
+
gErrorBars.selectAll('.cap-top')
|
| 362 |
+
.data(bars)
|
| 363 |
+
.join('line')
|
| 364 |
+
.attr('class', 'cap-top')
|
| 365 |
+
.attr('x1', d => d.x + d.width / 2 - 3)
|
| 366 |
+
.attr('x2', d => d.x + d.width / 2 + 3)
|
| 367 |
+
.attr('y1', d => y(d.mean + d.std))
|
| 368 |
+
.attr('y2', d => y(d.mean + d.std))
|
| 369 |
+
.attr('stroke', '#666')
|
| 370 |
+
.attr('stroke-width', 1.5)
|
| 371 |
+
.attr('opacity', 0.6);
|
| 372 |
+
|
| 373 |
+
gErrorBars.selectAll('.cap-bottom')
|
| 374 |
+
.data(bars)
|
| 375 |
+
.join('line')
|
| 376 |
+
.attr('class', 'cap-bottom')
|
| 377 |
+
.attr('x1', d => d.x + d.width / 2 - 3)
|
| 378 |
+
.attr('x2', d => d.x + d.width / 2 + 3)
|
| 379 |
+
.attr('y1', d => y(Math.max(0, d.mean - d.std)))
|
| 380 |
+
.attr('y2', d => y(Math.max(0, d.mean - d.std)))
|
| 381 |
+
.attr('stroke', '#666')
|
| 382 |
+
.attr('stroke-width', 1.5)
|
| 383 |
+
.attr('opacity', 0.6);
|
| 384 |
+
};
|
| 385 |
+
});
|
| 386 |
+
|
| 387 |
+
const updateAll = () => {
|
| 388 |
+
gridContainer.querySelectorAll('.subplot').forEach(subplot => {
|
| 389 |
+
if (subplot._render) subplot._render();
|
| 390 |
+
});
|
| 391 |
+
|
| 392 |
+
};
|
| 393 |
+
|
| 394 |
+
updateAll();
|
| 395 |
+
|
| 396 |
+
if (window.ResizeObserver) {
|
| 397 |
+
const ro = new ResizeObserver(() => updateAll());
|
| 398 |
+
ro.observe(container);
|
| 399 |
+
} else {
|
| 400 |
+
window.addEventListener('resize', updateAll);
|
| 401 |
+
}
|
| 402 |
+
})
|
| 403 |
+
.catch(err => {
|
| 404 |
+
container.innerHTML = `<div style="color: red; padding: 20px;">Error: ${err.message}</div>`;
|
| 405 |
+
});
|
| 406 |
+
};
|
| 407 |
+
|
| 408 |
+
if (document.readyState === 'loading') {
|
| 409 |
+
document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
|
| 410 |
+
} else {
|
| 411 |
+
ensureD3(bootstrap);
|
| 412 |
+
}
|
| 413 |
+
})();
|
| 414 |
+
</script>
|
app/src/content/embeds/d3-evaluation2-clamp.html
ADDED
|
@@ -0,0 +1,414 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<div class="d3-eval-grid d3-eval-grid-2"></div>
|
| 2 |
+
<style>
|
| 3 |
+
.d3-eval-grid {
|
| 4 |
+
padding: 2px;
|
| 5 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
| 6 |
+
}
|
| 7 |
+
|
| 8 |
+
.d3-eval-grid .grid-container {
|
| 9 |
+
display: grid;
|
| 10 |
+
grid-template-columns: repeat(2, 1fr);
|
| 11 |
+
gap: 8px;
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
@media (max-width: 768px) {
|
| 15 |
+
.d3-eval-grid .grid-container {
|
| 16 |
+
grid-template-columns: 1fr;
|
| 17 |
+
}
|
| 18 |
+
}
|
| 19 |
+
|
| 20 |
+
.d3-eval-grid .subplot {
|
| 21 |
+
padding: 4px;
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
.d3-eval-grid .subplot-title {
|
| 25 |
+
font-size: 12px;
|
| 26 |
+
font-weight: 600;
|
| 27 |
+
color: var(--text-color);
|
| 28 |
+
margin-bottom: 4px;
|
| 29 |
+
text-align: center;
|
| 30 |
+
}
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
.d3-eval-grid .axes path,
|
| 34 |
+
.d3-eval-grid .axes line {
|
| 35 |
+
stroke: var(--axis-color);
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
.d3-eval-grid .axes text {
|
| 39 |
+
fill: var(--tick-color);
|
| 40 |
+
font-size: 9px;
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
.d3-eval-grid .grid line {
|
| 44 |
+
stroke: var(--grid-color);
|
| 45 |
+
stroke-dasharray: 2,2;
|
| 46 |
+
opacity: 0.5;
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
.d3-eval-grid .axis-label {
|
| 50 |
+
fill: var(--text-color);
|
| 51 |
+
font-size: 11px;
|
| 52 |
+
font-weight: 600;
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
.d3-eval-grid .d3-tooltip {
|
| 56 |
+
position: absolute;
|
| 57 |
+
pointer-events: none;
|
| 58 |
+
padding: 8px 10px;
|
| 59 |
+
background: var(--surface-bg);
|
| 60 |
+
border: 1px solid var(--border-color);
|
| 61 |
+
border-radius: 8px;
|
| 62 |
+
font-size: 11px;
|
| 63 |
+
line-height: 1.5;
|
| 64 |
+
box-shadow: 0 4px 24px rgba(0,0,0,.18);
|
| 65 |
+
opacity: 0;
|
| 66 |
+
transition: opacity 0.2s;
|
| 67 |
+
z-index: 1000;
|
| 68 |
+
}
|
| 69 |
+
|
| 70 |
+
.d3-eval-grid .bar {
|
| 71 |
+
transition: opacity 0.2s;
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
.d3-eval-grid .bar.dimmed {
|
| 75 |
+
opacity: 0.2;
|
| 76 |
+
}
|
| 77 |
+
</style>
|
| 78 |
+
<script>
|
| 79 |
+
(() => {
|
| 80 |
+
const ensureD3 = (cb) => {
|
| 81 |
+
if (window.d3 && typeof window.d3.select === 'function') return cb();
|
| 82 |
+
let s = document.getElementById('d3-cdn-script');
|
| 83 |
+
if (!s) {
|
| 84 |
+
s = document.createElement('script');
|
| 85 |
+
s.id = 'd3-cdn-script';
|
| 86 |
+
s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
|
| 87 |
+
document.head.appendChild(s);
|
| 88 |
+
}
|
| 89 |
+
s.addEventListener('load', () => {
|
| 90 |
+
if (window.d3 && typeof window.d3.select === 'function') cb();
|
| 91 |
+
}, { once: true });
|
| 92 |
+
};
|
| 93 |
+
|
| 94 |
+
const bootstrap = () => {
|
| 95 |
+
const scriptEl = document.currentScript;
|
| 96 |
+
let container = scriptEl ? scriptEl.previousElementSibling : null;
|
| 97 |
+
if (!(container && container.classList && container.classList.contains('d3-eval-grid-2'))) {
|
| 98 |
+
const candidates = Array.from(document.querySelectorAll('.d3-eval-grid-2'))
|
| 99 |
+
.filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
|
| 100 |
+
container = candidates[candidates.length - 1] || null;
|
| 101 |
+
}
|
| 102 |
+
if (!container) return;
|
| 103 |
+
if (container.dataset) {
|
| 104 |
+
if (container.dataset.mounted === 'true') return;
|
| 105 |
+
container.dataset.mounted = 'true';
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
// Find data attribute
|
| 109 |
+
let mountEl = container;
|
| 110 |
+
while (mountEl && !mountEl.getAttribute?.('data-datafiles')) {
|
| 111 |
+
mountEl = mountEl.parentElement;
|
| 112 |
+
}
|
| 113 |
+
let providedData = null;
|
| 114 |
+
try {
|
| 115 |
+
const attr = mountEl && mountEl.getAttribute ? mountEl.getAttribute('data-datafiles') : null;
|
| 116 |
+
if (attr && attr.trim()) {
|
| 117 |
+
providedData = attr.trim().startsWith('[') ? JSON.parse(attr) : attr.trim();
|
| 118 |
+
}
|
| 119 |
+
} catch(_) {}
|
| 120 |
+
|
| 121 |
+
// Check for experiments filter attribute
|
| 122 |
+
let experimentsFilter = null;
|
| 123 |
+
try {
|
| 124 |
+
const expAttr = container.getAttribute('data-experiments');
|
| 125 |
+
if (expAttr) {
|
| 126 |
+
experimentsFilter = JSON.parse(expAttr);
|
| 127 |
+
}
|
| 128 |
+
} catch(_) {}
|
| 129 |
+
|
| 130 |
+
const DEFAULT_JSON = '/data/evaluation_summary.json';
|
| 131 |
+
const ensureDataPrefix = (p) => (typeof p === 'string' && p && !p.includes('/')) ? `/data/${p}` : p;
|
| 132 |
+
|
| 133 |
+
const JSON_PATHS = typeof providedData === 'string'
|
| 134 |
+
? [ensureDataPrefix(providedData)]
|
| 135 |
+
: [
|
| 136 |
+
DEFAULT_JSON,
|
| 137 |
+
'./assets/data/evaluation_summary.json',
|
| 138 |
+
'../assets/data/evaluation_summary.json',
|
| 139 |
+
'../../assets/data/evaluation_summary.json'
|
| 140 |
+
];
|
| 141 |
+
|
| 142 |
+
const fetchFirstAvailable = async (paths) => {
|
| 143 |
+
for (const p of paths) {
|
| 144 |
+
try {
|
| 145 |
+
const r = await fetch(p, { cache: 'no-cache' });
|
| 146 |
+
if (r.ok) return await r.json();
|
| 147 |
+
} catch(_){}
|
| 148 |
+
}
|
| 149 |
+
throw new Error('JSON not found');
|
| 150 |
+
};
|
| 151 |
+
|
| 152 |
+
fetchFirstAvailable(JSON_PATHS)
|
| 153 |
+
.then(rawData => {
|
| 154 |
+
// Chart 2: Add clamping experiments (but reserve space for all)
|
| 155 |
+
const allExperiments = ['Prompt', 'Basic steering', 'Clamping', 'Clamping + Penalty', '2D optimized', '8D optimized'];
|
| 156 |
+
const visibleExperiments = ['Prompt', 'Basic steering', 'Clamping', 'Clamping + Penalty'];
|
| 157 |
+
|
| 158 |
+
// Metrics in 2x4 grid layout (8 metrics)
|
| 159 |
+
const metrics = [
|
| 160 |
+
{ key: 'llm_score_concept', label: 'LLM Concept Score', format: d3.format('.2f') },
|
| 161 |
+
{ key: 'eiffel', label: 'Explicit Concept Presence', format: d3.format('.2f') },
|
| 162 |
+
{ key: 'llm_score_instruction', label: 'LLM Instruction Score', format: d3.format('.2f') },
|
| 163 |
+
{ key: 'minus_log_prob', label: 'Surprise in Original Model', format: d3.format('.2f') },
|
| 164 |
+
{ key: 'llm_score_fluency', label: 'LLM Fluency Score', format: d3.format('.2f') },
|
| 165 |
+
{ key: 'rep3', label: '3-gram Repetition Fraction', format: d3.format('.2f') },
|
| 166 |
+
{ key: 'mean_llm_score', label: 'Mean LLM Score', format: d3.format('.2f') },
|
| 167 |
+
{ key: 'harmonic_llm_score', label: 'Harmonic Mean LLM Score', format: d3.format('.2f') }
|
| 168 |
+
];
|
| 169 |
+
|
| 170 |
+
// Restructure data
|
| 171 |
+
const data = {};
|
| 172 |
+
rawData.forEach(d => {
|
| 173 |
+
if (!data[d.metric]) data[d.metric] = {};
|
| 174 |
+
data[d.metric][d.experiment] = { mean: d.mean, std: d.std };
|
| 175 |
+
});
|
| 176 |
+
|
| 177 |
+
// Color palette - consistent across all charts
|
| 178 |
+
const allColors = {
|
| 179 |
+
'Prompt': '#4c4c4c',
|
| 180 |
+
'Basic steering': '#b2b2b2',
|
| 181 |
+
'Clamping': '#b2b2cc',
|
| 182 |
+
'Clamping + Penalty': '#b2b2e6',
|
| 183 |
+
'2D optimized': '#b2ffb2',
|
| 184 |
+
'8D optimized': '#ffb2ff'
|
| 185 |
+
};
|
| 186 |
+
|
| 187 |
+
const gridContainer = document.createElement('div');
|
| 188 |
+
gridContainer.className = 'grid-container';
|
| 189 |
+
container.appendChild(gridContainer);
|
| 190 |
+
|
| 191 |
+
// Tooltip
|
| 192 |
+
const tooltip = d3.select(container).append('div')
|
| 193 |
+
.attr('class', 'd3-tooltip')
|
| 194 |
+
.style('transform', 'translate(-9999px, -9999px)');
|
| 195 |
+
|
| 196 |
+
let hoveredExperiment = null;
|
| 197 |
+
|
| 198 |
+
// Create each subplot
|
| 199 |
+
metrics.forEach((metric, idx) => {
|
| 200 |
+
const subplot = document.createElement('div');
|
| 201 |
+
subplot.className = 'subplot';
|
| 202 |
+
subplot.dataset.metric = metric.key;
|
| 203 |
+
gridContainer.appendChild(subplot);
|
| 204 |
+
|
| 205 |
+
const title = document.createElement('div');
|
| 206 |
+
title.className = 'subplot-title';
|
| 207 |
+
title.textContent = metric.label;
|
| 208 |
+
subplot.appendChild(title);
|
| 209 |
+
|
| 210 |
+
const svg = d3.select(subplot).append('svg')
|
| 211 |
+
.attr('width', '100%')
|
| 212 |
+
.style('display', 'block');
|
| 213 |
+
|
| 214 |
+
const g = svg.append('g');
|
| 215 |
+
const gGrid = g.append('g').attr('class', 'grid');
|
| 216 |
+
const gBars = g.append('g').attr('class', 'bars');
|
| 217 |
+
const gErrorBars = g.append('g').attr('class', 'error-bars');
|
| 218 |
+
const gAxes = g.append('g').attr('class', 'axes');
|
| 219 |
+
const gLabels = g.append('g').attr('class', 'value-labels');
|
| 220 |
+
|
| 221 |
+
subplot._render = () => {
|
| 222 |
+
const width = subplot.clientWidth || 300;
|
| 223 |
+
const height = Math.max(200, Math.round(width * 0.6));
|
| 224 |
+
const margin = { top: 10, right: 20, bottom: 70, left: 42 };
|
| 225 |
+
const innerWidth = width - margin.left - margin.right;
|
| 226 |
+
const innerHeight = height - margin.top - margin.bottom;
|
| 227 |
+
|
| 228 |
+
svg.attr('height', height);
|
| 229 |
+
g.attr('transform', `translate(${margin.left},${margin.top})`);
|
| 230 |
+
|
| 231 |
+
// Scales - use all experiments for consistent positioning
|
| 232 |
+
const x = d3.scaleBand()
|
| 233 |
+
.domain(allExperiments)
|
| 234 |
+
.range([0, innerWidth])
|
| 235 |
+
.padding(0.2);
|
| 236 |
+
|
| 237 |
+
// Fixed y-axis ranges based on metric type
|
| 238 |
+
const yDomains = {
|
| 239 |
+
'llm_score_concept': [0, 2],
|
| 240 |
+
'llm_score_instruction': [0, 2],
|
| 241 |
+
'llm_score_fluency': [0, 2],
|
| 242 |
+
'mean_llm_score': [0, 2],
|
| 243 |
+
'harmonic_llm_score': [0, 2],
|
| 244 |
+
'eiffel': [0, 1],
|
| 245 |
+
'minus_log_prob': [0, 2],
|
| 246 |
+
'rep3': [0, 0.5]
|
| 247 |
+
};
|
| 248 |
+
|
| 249 |
+
const y = d3.scaleLinear()
|
| 250 |
+
.domain(yDomains[metric.key] || [0, 1])
|
| 251 |
+
.range([innerHeight, 0]);
|
| 252 |
+
|
| 253 |
+
// Grid
|
| 254 |
+
gGrid.selectAll('*').remove();
|
| 255 |
+
gGrid.selectAll('line')
|
| 256 |
+
.data(y.ticks(4))
|
| 257 |
+
.join('line')
|
| 258 |
+
.attr('x1', 0)
|
| 259 |
+
.attr('x2', innerWidth)
|
| 260 |
+
.attr('y1', d => y(d))
|
| 261 |
+
.attr('y2', d => y(d));
|
| 262 |
+
|
| 263 |
+
// Axes
|
| 264 |
+
gAxes.selectAll('*').remove();
|
| 265 |
+
|
| 266 |
+
const xAxis = gAxes.append('g')
|
| 267 |
+
.attr('transform', `translate(0,${innerHeight})`)
|
| 268 |
+
.call(d3.axisBottom(x).tickSize(3));
|
| 269 |
+
|
| 270 |
+
// Only show labels for visible experiments
|
| 271 |
+
xAxis.selectAll('text')
|
| 272 |
+
.attr('transform', 'rotate(-45)')
|
| 273 |
+
.style('text-anchor', 'end')
|
| 274 |
+
.attr('dx', '-0.5em')
|
| 275 |
+
.attr('dy', '0.15em')
|
| 276 |
+
.style('opacity', function() {
|
| 277 |
+
const text = d3.select(this).text();
|
| 278 |
+
return visibleExperiments.includes(text) ? 1 : 0;
|
| 279 |
+
});
|
| 280 |
+
|
| 281 |
+
gAxes.append('g')
|
| 282 |
+
.call(d3.axisLeft(y).ticks(4).tickFormat(metric.format).tickSize(3));
|
| 283 |
+
|
| 284 |
+
// Draw bars (only for visible experiments)
|
| 285 |
+
const bars = [];
|
| 286 |
+
visibleExperiments.forEach(exp => {
|
| 287 |
+
const d = data[metric.key]?.[exp];
|
| 288 |
+
if (d) {
|
| 289 |
+
bars.push({
|
| 290 |
+
experiment: exp,
|
| 291 |
+
mean: d.mean,
|
| 292 |
+
std: d.std,
|
| 293 |
+
color: allColors[exp],
|
| 294 |
+
x: x(exp),
|
| 295 |
+
y: y(d.mean),
|
| 296 |
+
width: x.bandwidth(),
|
| 297 |
+
height: innerHeight - y(d.mean)
|
| 298 |
+
});
|
| 299 |
+
}
|
| 300 |
+
});
|
| 301 |
+
|
| 302 |
+
gBars.selectAll('rect')
|
| 303 |
+
.data(bars)
|
| 304 |
+
.join('rect')
|
| 305 |
+
.attr('class', 'bar')
|
| 306 |
+
.attr('x', d => d.x)
|
| 307 |
+
.attr('y', d => d.y)
|
| 308 |
+
.attr('width', d => d.width)
|
| 309 |
+
.attr('height', d => d.height)
|
| 310 |
+
.attr('fill', d => d.color)
|
| 311 |
+
.attr('rx', 2)
|
| 312 |
+
.classed('dimmed', d => hoveredExperiment && d.experiment !== hoveredExperiment)
|
| 313 |
+
.on('mouseenter', (event, d) => {
|
| 314 |
+
hoveredExperiment = d.experiment;
|
| 315 |
+
|
| 316 |
+
// Show value label on bar
|
| 317 |
+
gLabels.selectAll('text').remove();
|
| 318 |
+
gLabels.append('text')
|
| 319 |
+
.attr('x', d.x + d.width / 2)
|
| 320 |
+
.attr('y', d.y - 5)
|
| 321 |
+
.attr('text-anchor', 'middle')
|
| 322 |
+
.attr('fill', 'var(--text-color)')
|
| 323 |
+
.attr('font-size', '11px')
|
| 324 |
+
.attr('font-weight', '600')
|
| 325 |
+
.text(metric.format(d.mean));
|
| 326 |
+
|
| 327 |
+
updateAll();
|
| 328 |
+
tooltip
|
| 329 |
+
.style('opacity', 1)
|
| 330 |
+
.html(`
|
| 331 |
+
<div><strong>${d.experiment}</strong></div>
|
| 332 |
+
<div style="margin-top: 4px;">${metric.label}</div>
|
| 333 |
+
<div style="margin-top: 4px;"><strong>Mean:</strong> ${metric.format(d.mean)}</div>
|
| 334 |
+
<div><strong>Std:</strong> ${metric.format(d.std)}</div>
|
| 335 |
+
`);
|
| 336 |
+
})
|
| 337 |
+
.on('mousemove', (event) => {
|
| 338 |
+
const [mx, my] = d3.pointer(event, container);
|
| 339 |
+
tooltip.style('transform', `translate(${mx + 10}px, ${my + 10}px)`);
|
| 340 |
+
})
|
| 341 |
+
.on('mouseleave', () => {
|
| 342 |
+
hoveredExperiment = null;
|
| 343 |
+
gLabels.selectAll('text').remove();
|
| 344 |
+
updateAll();
|
| 345 |
+
tooltip.style('opacity', 0).style('transform', 'translate(-9999px, -9999px)');
|
| 346 |
+
});
|
| 347 |
+
|
| 348 |
+
// Error bars
|
| 349 |
+
gErrorBars.selectAll('line')
|
| 350 |
+
.data(bars)
|
| 351 |
+
.join('line')
|
| 352 |
+
.attr('x1', d => d.x + d.width / 2)
|
| 353 |
+
.attr('x2', d => d.x + d.width / 2)
|
| 354 |
+
.attr('y1', d => y(d.mean + d.std))
|
| 355 |
+
.attr('y2', d => y(Math.max(0, d.mean - d.std)))
|
| 356 |
+
.attr('stroke', '#666')
|
| 357 |
+
.attr('stroke-width', 1.5)
|
| 358 |
+
.attr('opacity', 0.6);
|
| 359 |
+
|
| 360 |
+
// Error bar caps
|
| 361 |
+
gErrorBars.selectAll('.cap-top')
|
| 362 |
+
.data(bars)
|
| 363 |
+
.join('line')
|
| 364 |
+
.attr('class', 'cap-top')
|
| 365 |
+
.attr('x1', d => d.x + d.width / 2 - 3)
|
| 366 |
+
.attr('x2', d => d.x + d.width / 2 + 3)
|
| 367 |
+
.attr('y1', d => y(d.mean + d.std))
|
| 368 |
+
.attr('y2', d => y(d.mean + d.std))
|
| 369 |
+
.attr('stroke', '#666')
|
| 370 |
+
.attr('stroke-width', 1.5)
|
| 371 |
+
.attr('opacity', 0.6);
|
| 372 |
+
|
| 373 |
+
gErrorBars.selectAll('.cap-bottom')
|
| 374 |
+
.data(bars)
|
| 375 |
+
.join('line')
|
| 376 |
+
.attr('class', 'cap-bottom')
|
| 377 |
+
.attr('x1', d => d.x + d.width / 2 - 3)
|
| 378 |
+
.attr('x2', d => d.x + d.width / 2 + 3)
|
| 379 |
+
.attr('y1', d => y(Math.max(0, d.mean - d.std)))
|
| 380 |
+
.attr('y2', d => y(Math.max(0, d.mean - d.std)))
|
| 381 |
+
.attr('stroke', '#666')
|
| 382 |
+
.attr('stroke-width', 1.5)
|
| 383 |
+
.attr('opacity', 0.6);
|
| 384 |
+
};
|
| 385 |
+
});
|
| 386 |
+
|
| 387 |
+
const updateAll = () => {
|
| 388 |
+
gridContainer.querySelectorAll('.subplot').forEach(subplot => {
|
| 389 |
+
if (subplot._render) subplot._render();
|
| 390 |
+
});
|
| 391 |
+
|
| 392 |
+
};
|
| 393 |
+
|
| 394 |
+
updateAll();
|
| 395 |
+
|
| 396 |
+
if (window.ResizeObserver) {
|
| 397 |
+
const ro = new ResizeObserver(() => updateAll());
|
| 398 |
+
ro.observe(container);
|
| 399 |
+
} else {
|
| 400 |
+
window.addEventListener('resize', updateAll);
|
| 401 |
+
}
|
| 402 |
+
})
|
| 403 |
+
.catch(err => {
|
| 404 |
+
container.innerHTML = `<div style="color: red; padding: 20px;">Error: ${err.message}</div>`;
|
| 405 |
+
});
|
| 406 |
+
};
|
| 407 |
+
|
| 408 |
+
if (document.readyState === 'loading') {
|
| 409 |
+
document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
|
| 410 |
+
} else {
|
| 411 |
+
ensureD3(bootstrap);
|
| 412 |
+
}
|
| 413 |
+
})();
|
| 414 |
+
</script>
|
app/src/content/embeds/d3-evaluation3-multi.html
ADDED
|
@@ -0,0 +1,414 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<div class="d3-eval-grid d3-eval-grid-3"></div>
|
| 2 |
+
<style>
|
| 3 |
+
.d3-eval-grid {
|
| 4 |
+
padding: 2px;
|
| 5 |
+
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
|
| 6 |
+
}
|
| 7 |
+
|
| 8 |
+
.d3-eval-grid .grid-container {
|
| 9 |
+
display: grid;
|
| 10 |
+
grid-template-columns: repeat(2, 1fr);
|
| 11 |
+
gap: 8px;
|
| 12 |
+
}
|
| 13 |
+
|
| 14 |
+
@media (max-width: 768px) {
|
| 15 |
+
.d3-eval-grid .grid-container {
|
| 16 |
+
grid-template-columns: 1fr;
|
| 17 |
+
}
|
| 18 |
+
}
|
| 19 |
+
|
| 20 |
+
.d3-eval-grid .subplot {
|
| 21 |
+
padding: 4px;
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
.d3-eval-grid .subplot-title {
|
| 25 |
+
font-size: 12px;
|
| 26 |
+
font-weight: 600;
|
| 27 |
+
color: var(--text-color);
|
| 28 |
+
margin-bottom: 4px;
|
| 29 |
+
text-align: center;
|
| 30 |
+
}
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
.d3-eval-grid .axes path,
|
| 34 |
+
.d3-eval-grid .axes line {
|
| 35 |
+
stroke: var(--axis-color);
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
.d3-eval-grid .axes text {
|
| 39 |
+
fill: var(--tick-color);
|
| 40 |
+
font-size: 9px;
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
.d3-eval-grid .grid line {
|
| 44 |
+
stroke: var(--grid-color);
|
| 45 |
+
stroke-dasharray: 2,2;
|
| 46 |
+
opacity: 0.5;
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
.d3-eval-grid .axis-label {
|
| 50 |
+
fill: var(--text-color);
|
| 51 |
+
font-size: 11px;
|
| 52 |
+
font-weight: 600;
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
.d3-eval-grid .d3-tooltip {
|
| 56 |
+
position: absolute;
|
| 57 |
+
pointer-events: none;
|
| 58 |
+
padding: 8px 10px;
|
| 59 |
+
background: var(--surface-bg);
|
| 60 |
+
border: 1px solid var(--border-color);
|
| 61 |
+
border-radius: 8px;
|
| 62 |
+
font-size: 11px;
|
| 63 |
+
line-height: 1.5;
|
| 64 |
+
box-shadow: 0 4px 24px rgba(0,0,0,.18);
|
| 65 |
+
opacity: 0;
|
| 66 |
+
transition: opacity 0.2s;
|
| 67 |
+
z-index: 1000;
|
| 68 |
+
}
|
| 69 |
+
|
| 70 |
+
.d3-eval-grid .bar {
|
| 71 |
+
transition: opacity 0.2s;
|
| 72 |
+
}
|
| 73 |
+
|
| 74 |
+
.d3-eval-grid .bar.dimmed {
|
| 75 |
+
opacity: 0.2;
|
| 76 |
+
}
|
| 77 |
+
</style>
|
| 78 |
+
<script>
|
| 79 |
+
(() => {
|
| 80 |
+
const ensureD3 = (cb) => {
|
| 81 |
+
if (window.d3 && typeof window.d3.select === 'function') return cb();
|
| 82 |
+
let s = document.getElementById('d3-cdn-script');
|
| 83 |
+
if (!s) {
|
| 84 |
+
s = document.createElement('script');
|
| 85 |
+
s.id = 'd3-cdn-script';
|
| 86 |
+
s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
|
| 87 |
+
document.head.appendChild(s);
|
| 88 |
+
}
|
| 89 |
+
s.addEventListener('load', () => {
|
| 90 |
+
if (window.d3 && typeof window.d3.select === 'function') cb();
|
| 91 |
+
}, { once: true });
|
| 92 |
+
};
|
| 93 |
+
|
| 94 |
+
const bootstrap = () => {
|
| 95 |
+
const scriptEl = document.currentScript;
|
| 96 |
+
let container = scriptEl ? scriptEl.previousElementSibling : null;
|
| 97 |
+
if (!(container && container.classList && container.classList.contains('d3-eval-grid-3'))) {
|
| 98 |
+
const candidates = Array.from(document.querySelectorAll('.d3-eval-grid-3'))
|
| 99 |
+
.filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
|
| 100 |
+
container = candidates[candidates.length - 1] || null;
|
| 101 |
+
}
|
| 102 |
+
if (!container) return;
|
| 103 |
+
if (container.dataset) {
|
| 104 |
+
if (container.dataset.mounted === 'true') return;
|
| 105 |
+
container.dataset.mounted = 'true';
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
// Find data attribute
|
| 109 |
+
let mountEl = container;
|
| 110 |
+
while (mountEl && !mountEl.getAttribute?.('data-datafiles')) {
|
| 111 |
+
mountEl = mountEl.parentElement;
|
| 112 |
+
}
|
| 113 |
+
let providedData = null;
|
| 114 |
+
try {
|
| 115 |
+
const attr = mountEl && mountEl.getAttribute ? mountEl.getAttribute('data-datafiles') : null;
|
| 116 |
+
if (attr && attr.trim()) {
|
| 117 |
+
providedData = attr.trim().startsWith('[') ? JSON.parse(attr) : attr.trim();
|
| 118 |
+
}
|
| 119 |
+
} catch(_) {}
|
| 120 |
+
|
| 121 |
+
// Check for experiments filter attribute
|
| 122 |
+
let experimentsFilter = null;
|
| 123 |
+
try {
|
| 124 |
+
const expAttr = container.getAttribute('data-experiments');
|
| 125 |
+
if (expAttr) {
|
| 126 |
+
experimentsFilter = JSON.parse(expAttr);
|
| 127 |
+
}
|
| 128 |
+
} catch(_) {}
|
| 129 |
+
|
| 130 |
+
const DEFAULT_JSON = '/data/evaluation_summary.json';
|
| 131 |
+
const ensureDataPrefix = (p) => (typeof p === 'string' && p && !p.includes('/')) ? `/data/${p}` : p;
|
| 132 |
+
|
| 133 |
+
const JSON_PATHS = typeof providedData === 'string'
|
| 134 |
+
? [ensureDataPrefix(providedData)]
|
| 135 |
+
: [
|
| 136 |
+
DEFAULT_JSON,
|
| 137 |
+
'./assets/data/evaluation_summary.json',
|
| 138 |
+
'../assets/data/evaluation_summary.json',
|
| 139 |
+
'../../assets/data/evaluation_summary.json'
|
| 140 |
+
];
|
| 141 |
+
|
| 142 |
+
const fetchFirstAvailable = async (paths) => {
|
| 143 |
+
for (const p of paths) {
|
| 144 |
+
try {
|
| 145 |
+
const r = await fetch(p, { cache: 'no-cache' });
|
| 146 |
+
if (r.ok) return await r.json();
|
| 147 |
+
} catch(_){}
|
| 148 |
+
}
|
| 149 |
+
throw new Error('JSON not found');
|
| 150 |
+
};
|
| 151 |
+
|
| 152 |
+
fetchFirstAvailable(JSON_PATHS)
|
| 153 |
+
.then(rawData => {
|
| 154 |
+
// Chart 3: All experiments including multi-layer optimization
|
| 155 |
+
const allExperiments = ['Prompt', 'Basic steering', 'Clamping', 'Clamping + Penalty', '2D optimized', '8D optimized'];
|
| 156 |
+
const visibleExperiments = allExperiments;
|
| 157 |
+
|
| 158 |
+
// Metrics in 2x4 grid layout (8 metrics)
|
| 159 |
+
const metrics = [
|
| 160 |
+
{ key: 'llm_score_concept', label: 'LLM Concept Score', format: d3.format('.2f') },
|
| 161 |
+
{ key: 'eiffel', label: 'Explicit Concept Presence', format: d3.format('.2f') },
|
| 162 |
+
{ key: 'llm_score_instruction', label: 'LLM Instruction Score', format: d3.format('.2f') },
|
| 163 |
+
{ key: 'minus_log_prob', label: 'Surprise in Original Model', format: d3.format('.2f') },
|
| 164 |
+
{ key: 'llm_score_fluency', label: 'LLM Fluency Score', format: d3.format('.2f') },
|
| 165 |
+
{ key: 'rep3', label: '3-gram Repetition Fraction', format: d3.format('.2f') },
|
| 166 |
+
{ key: 'mean_llm_score', label: 'Mean LLM Score', format: d3.format('.2f') },
|
| 167 |
+
{ key: 'harmonic_llm_score', label: 'Harmonic Mean LLM Score', format: d3.format('.2f') }
|
| 168 |
+
];
|
| 169 |
+
|
| 170 |
+
// Restructure data
|
| 171 |
+
const data = {};
|
| 172 |
+
rawData.forEach(d => {
|
| 173 |
+
if (!data[d.metric]) data[d.metric] = {};
|
| 174 |
+
data[d.metric][d.experiment] = { mean: d.mean, std: d.std };
|
| 175 |
+
});
|
| 176 |
+
|
| 177 |
+
// Color palette - consistent across all charts
|
| 178 |
+
const allColors = {
|
| 179 |
+
'Prompt': '#4c4c4c',
|
| 180 |
+
'Basic steering': '#b2b2b2',
|
| 181 |
+
'Clamping': '#b2b2cc',
|
| 182 |
+
'Clamping + Penalty': '#b2b2e6',
|
| 183 |
+
'2D optimized': '#b2ffb2',
|
| 184 |
+
'8D optimized': '#ffb2ff'
|
| 185 |
+
};
|
| 186 |
+
|
| 187 |
+
const gridContainer = document.createElement('div');
|
| 188 |
+
gridContainer.className = 'grid-container';
|
| 189 |
+
container.appendChild(gridContainer);
|
| 190 |
+
|
| 191 |
+
// Tooltip
|
| 192 |
+
const tooltip = d3.select(container).append('div')
|
| 193 |
+
.attr('class', 'd3-tooltip')
|
| 194 |
+
.style('transform', 'translate(-9999px, -9999px)');
|
| 195 |
+
|
| 196 |
+
let hoveredExperiment = null;
|
| 197 |
+
|
| 198 |
+
// Create each subplot
|
| 199 |
+
metrics.forEach((metric, idx) => {
|
| 200 |
+
const subplot = document.createElement('div');
|
| 201 |
+
subplot.className = 'subplot';
|
| 202 |
+
subplot.dataset.metric = metric.key;
|
| 203 |
+
gridContainer.appendChild(subplot);
|
| 204 |
+
|
| 205 |
+
const title = document.createElement('div');
|
| 206 |
+
title.className = 'subplot-title';
|
| 207 |
+
title.textContent = metric.label;
|
| 208 |
+
subplot.appendChild(title);
|
| 209 |
+
|
| 210 |
+
const svg = d3.select(subplot).append('svg')
|
| 211 |
+
.attr('width', '100%')
|
| 212 |
+
.style('display', 'block');
|
| 213 |
+
|
| 214 |
+
const g = svg.append('g');
|
| 215 |
+
const gGrid = g.append('g').attr('class', 'grid');
|
| 216 |
+
const gBars = g.append('g').attr('class', 'bars');
|
| 217 |
+
const gErrorBars = g.append('g').attr('class', 'error-bars');
|
| 218 |
+
const gAxes = g.append('g').attr('class', 'axes');
|
| 219 |
+
const gLabels = g.append('g').attr('class', 'value-labels');
|
| 220 |
+
|
| 221 |
+
subplot._render = () => {
|
| 222 |
+
const width = subplot.clientWidth || 300;
|
| 223 |
+
const height = Math.max(200, Math.round(width * 0.6));
|
| 224 |
+
const margin = { top: 10, right: 20, bottom: 70, left: 42 };
|
| 225 |
+
const innerWidth = width - margin.left - margin.right;
|
| 226 |
+
const innerHeight = height - margin.top - margin.bottom;
|
| 227 |
+
|
| 228 |
+
svg.attr('height', height);
|
| 229 |
+
g.attr('transform', `translate(${margin.left},${margin.top})`);
|
| 230 |
+
|
| 231 |
+
// Scales - use all experiments for consistent positioning
|
| 232 |
+
const x = d3.scaleBand()
|
| 233 |
+
.domain(allExperiments)
|
| 234 |
+
.range([0, innerWidth])
|
| 235 |
+
.padding(0.2);
|
| 236 |
+
|
| 237 |
+
// Fixed y-axis ranges based on metric type
|
| 238 |
+
const yDomains = {
|
| 239 |
+
'llm_score_concept': [0, 2],
|
| 240 |
+
'llm_score_instruction': [0, 2],
|
| 241 |
+
'llm_score_fluency': [0, 2],
|
| 242 |
+
'mean_llm_score': [0, 2],
|
| 243 |
+
'harmonic_llm_score': [0, 2],
|
| 244 |
+
'eiffel': [0, 1],
|
| 245 |
+
'minus_log_prob': [0, 2],
|
| 246 |
+
'rep3': [0, 0.5]
|
| 247 |
+
};
|
| 248 |
+
|
| 249 |
+
const y = d3.scaleLinear()
|
| 250 |
+
.domain(yDomains[metric.key] || [0, 1])
|
| 251 |
+
.range([innerHeight, 0]);
|
| 252 |
+
|
| 253 |
+
// Grid
|
| 254 |
+
gGrid.selectAll('*').remove();
|
| 255 |
+
gGrid.selectAll('line')
|
| 256 |
+
.data(y.ticks(4))
|
| 257 |
+
.join('line')
|
| 258 |
+
.attr('x1', 0)
|
| 259 |
+
.attr('x2', innerWidth)
|
| 260 |
+
.attr('y1', d => y(d))
|
| 261 |
+
.attr('y2', d => y(d));
|
| 262 |
+
|
| 263 |
+
// Axes
|
| 264 |
+
gAxes.selectAll('*').remove();
|
| 265 |
+
|
| 266 |
+
const xAxis = gAxes.append('g')
|
| 267 |
+
.attr('transform', `translate(0,${innerHeight})`)
|
| 268 |
+
.call(d3.axisBottom(x).tickSize(3));
|
| 269 |
+
|
| 270 |
+
// Only show labels for visible experiments
|
| 271 |
+
xAxis.selectAll('text')
|
| 272 |
+
.attr('transform', 'rotate(-45)')
|
| 273 |
+
.style('text-anchor', 'end')
|
| 274 |
+
.attr('dx', '-0.5em')
|
| 275 |
+
.attr('dy', '0.15em')
|
| 276 |
+
.style('opacity', function() {
|
| 277 |
+
const text = d3.select(this).text();
|
| 278 |
+
return visibleExperiments.includes(text) ? 1 : 0;
|
| 279 |
+
});
|
| 280 |
+
|
| 281 |
+
gAxes.append('g')
|
| 282 |
+
.call(d3.axisLeft(y).ticks(4).tickFormat(metric.format).tickSize(3));
|
| 283 |
+
|
| 284 |
+
// Draw bars (only for visible experiments)
|
| 285 |
+
const bars = [];
|
| 286 |
+
visibleExperiments.forEach(exp => {
|
| 287 |
+
const d = data[metric.key]?.[exp];
|
| 288 |
+
if (d) {
|
| 289 |
+
bars.push({
|
| 290 |
+
experiment: exp,
|
| 291 |
+
mean: d.mean,
|
| 292 |
+
std: d.std,
|
| 293 |
+
color: allColors[exp],
|
| 294 |
+
x: x(exp),
|
| 295 |
+
y: y(d.mean),
|
| 296 |
+
width: x.bandwidth(),
|
| 297 |
+
height: innerHeight - y(d.mean)
|
| 298 |
+
});
|
| 299 |
+
}
|
| 300 |
+
});
|
| 301 |
+
|
| 302 |
+
gBars.selectAll('rect')
|
| 303 |
+
.data(bars)
|
| 304 |
+
.join('rect')
|
| 305 |
+
.attr('class', 'bar')
|
| 306 |
+
.attr('x', d => d.x)
|
| 307 |
+
.attr('y', d => d.y)
|
| 308 |
+
.attr('width', d => d.width)
|
| 309 |
+
.attr('height', d => d.height)
|
| 310 |
+
.attr('fill', d => d.color)
|
| 311 |
+
.attr('rx', 2)
|
| 312 |
+
.classed('dimmed', d => hoveredExperiment && d.experiment !== hoveredExperiment)
|
| 313 |
+
.on('mouseenter', (event, d) => {
|
| 314 |
+
hoveredExperiment = d.experiment;
|
| 315 |
+
|
| 316 |
+
// Show value label on bar
|
| 317 |
+
gLabels.selectAll('text').remove();
|
| 318 |
+
gLabels.append('text')
|
| 319 |
+
.attr('x', d.x + d.width / 2)
|
| 320 |
+
.attr('y', d.y - 5)
|
| 321 |
+
.attr('text-anchor', 'middle')
|
| 322 |
+
.attr('fill', 'var(--text-color)')
|
| 323 |
+
.attr('font-size', '11px')
|
| 324 |
+
.attr('font-weight', '600')
|
| 325 |
+
.text(metric.format(d.mean));
|
| 326 |
+
|
| 327 |
+
updateAll();
|
| 328 |
+
tooltip
|
| 329 |
+
.style('opacity', 1)
|
| 330 |
+
.html(`
|
| 331 |
+
<div><strong>${d.experiment}</strong></div>
|
| 332 |
+
<div style="margin-top: 4px;">${metric.label}</div>
|
| 333 |
+
<div style="margin-top: 4px;"><strong>Mean:</strong> ${metric.format(d.mean)}</div>
|
| 334 |
+
<div><strong>Std:</strong> ${metric.format(d.std)}</div>
|
| 335 |
+
`);
|
| 336 |
+
})
|
| 337 |
+
.on('mousemove', (event) => {
|
| 338 |
+
const [mx, my] = d3.pointer(event, container);
|
| 339 |
+
tooltip.style('transform', `translate(${mx + 10}px, ${my + 10}px)`);
|
| 340 |
+
})
|
| 341 |
+
.on('mouseleave', () => {
|
| 342 |
+
hoveredExperiment = null;
|
| 343 |
+
gLabels.selectAll('text').remove();
|
| 344 |
+
updateAll();
|
| 345 |
+
tooltip.style('opacity', 0).style('transform', 'translate(-9999px, -9999px)');
|
| 346 |
+
});
|
| 347 |
+
|
| 348 |
+
// Error bars
|
| 349 |
+
gErrorBars.selectAll('line')
|
| 350 |
+
.data(bars)
|
| 351 |
+
.join('line')
|
| 352 |
+
.attr('x1', d => d.x + d.width / 2)
|
| 353 |
+
.attr('x2', d => d.x + d.width / 2)
|
| 354 |
+
.attr('y1', d => y(d.mean + d.std))
|
| 355 |
+
.attr('y2', d => y(Math.max(0, d.mean - d.std)))
|
| 356 |
+
.attr('stroke', '#666')
|
| 357 |
+
.attr('stroke-width', 1.5)
|
| 358 |
+
.attr('opacity', 0.6);
|
| 359 |
+
|
| 360 |
+
// Error bar caps
|
| 361 |
+
gErrorBars.selectAll('.cap-top')
|
| 362 |
+
.data(bars)
|
| 363 |
+
.join('line')
|
| 364 |
+
.attr('class', 'cap-top')
|
| 365 |
+
.attr('x1', d => d.x + d.width / 2 - 3)
|
| 366 |
+
.attr('x2', d => d.x + d.width / 2 + 3)
|
| 367 |
+
.attr('y1', d => y(d.mean + d.std))
|
| 368 |
+
.attr('y2', d => y(d.mean + d.std))
|
| 369 |
+
.attr('stroke', '#666')
|
| 370 |
+
.attr('stroke-width', 1.5)
|
| 371 |
+
.attr('opacity', 0.6);
|
| 372 |
+
|
| 373 |
+
gErrorBars.selectAll('.cap-bottom')
|
| 374 |
+
.data(bars)
|
| 375 |
+
.join('line')
|
| 376 |
+
.attr('class', 'cap-bottom')
|
| 377 |
+
.attr('x1', d => d.x + d.width / 2 - 3)
|
| 378 |
+
.attr('x2', d => d.x + d.width / 2 + 3)
|
| 379 |
+
.attr('y1', d => y(Math.max(0, d.mean - d.std)))
|
| 380 |
+
.attr('y2', d => y(Math.max(0, d.mean - d.std)))
|
| 381 |
+
.attr('stroke', '#666')
|
| 382 |
+
.attr('stroke-width', 1.5)
|
| 383 |
+
.attr('opacity', 0.6);
|
| 384 |
+
};
|
| 385 |
+
});
|
| 386 |
+
|
| 387 |
+
const updateAll = () => {
|
| 388 |
+
gridContainer.querySelectorAll('.subplot').forEach(subplot => {
|
| 389 |
+
if (subplot._render) subplot._render();
|
| 390 |
+
});
|
| 391 |
+
|
| 392 |
+
};
|
| 393 |
+
|
| 394 |
+
updateAll();
|
| 395 |
+
|
| 396 |
+
if (window.ResizeObserver) {
|
| 397 |
+
const ro = new ResizeObserver(() => updateAll());
|
| 398 |
+
ro.observe(container);
|
| 399 |
+
} else {
|
| 400 |
+
window.addEventListener('resize', updateAll);
|
| 401 |
+
}
|
| 402 |
+
})
|
| 403 |
+
.catch(err => {
|
| 404 |
+
container.innerHTML = `<div style="color: red; padding: 20px;">Error: ${err.message}</div>`;
|
| 405 |
+
});
|
| 406 |
+
};
|
| 407 |
+
|
| 408 |
+
if (document.readyState === 'loading') {
|
| 409 |
+
document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
|
| 410 |
+
} else {
|
| 411 |
+
ensureD3(bootstrap);
|
| 412 |
+
}
|
| 413 |
+
})();
|
| 414 |
+
</script>
|