evaluation-guidebook

Running

App Files Files Community

Update app/src/content/chapters/intro.mdx

by guipenedo HF Staff - opened 6 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+135

-1762

Files changed (17) hide show

README.md +1 -1
app/src/content/article.mdx +5 -22
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +32 -25
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx +5 -15
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx +5 -6
app/src/content/chapters/intro.mdx +6 -15
app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx +1 -1
app/src/content/embeds/banner.html +13 -38
app/src/content/embeds/d3-ablation-workflow.html +0 -474
app/src/content/embeds/d3-human-biases.html +0 -352
app/src/content/embeds/d3-llm-biases.html +0 -378
app/src/content/embeds/d3-mmlu-heatmap.html +50 -77
app/src/content/embeds/d3-sampling-metrics.html +11 -18
app/src/content/embeds/d3-text-metrics.html +4 -0
app/src/content/embeds/d3-tokenization-timeline.html +1 -1
app/src/content/embeds/d3-vibe-checks.html +0 -338
app/src/styles/components/_table.css +1 -1

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ emoji: 📝
 colorFrom: blue
 colorTo: indigo
 sdk: docker
-pinned: true
 header: mini
 app_port: 8080
 tags:

 colorFrom: blue
 colorTo: indigo
 sdk: docker
+pinned: false
 header: mini
 app_port: 8080
 tags:

app/src/content/article.mdx CHANGED Viewed

@@ -1,12 +1,12 @@
 ---
-title: "The LLM Evaluation Guidebook"
-subtitle: "All the things you could want to know about LLM evaluation based on our experience scoring 15000 models over 3 years"
 description: "Understanding the tips and tricks of evaluating an LLM in 2025"
 authors:
   - name: "Clémentine Fourrier"
     url: "https://huggingface.co/clefourrier"
     affiliations: [1]
-  - name: "Thibaud Frere"
     url: "https://huggingface.co/tfrere"
     affiliations: [1]
   - name: "Guilherme Penedo"
@@ -18,7 +18,7 @@ authors:
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
-published: "Dec. 03, 2025"
 tags:
   - research
   - evaluation
@@ -47,21 +47,7 @@ Now that you have an idea of why evaluation is important to different people, le
 ## Evaluating with existing benchmarks
-Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team.
-<Note title="Important concepts"  emoji="⚠️" variant="info">
-In this section, you'll see two concepts mentionned quite a lot: contamination and saturation.
-**Saturation** is when model performance on a benchmark passes human performance. More generally, the term is used for datasets that are no longer considered useful, as they have lost discriminative power between models.
-<Sidenote> It's what you observe in the banner picture! </Sidenote>
-*If all models have close to the highest possible score on your evaluation, it's no longer a discriminative benchmark. It's similar to evaluating high school students on pre-school problems: success tells you nothing (though failure is indicative).*
-**Contamination** is when an evaluation dataset ended up in the training dataset of models, in which case the performance of models is artificially inflated, and does not reflect real world performance on the task.
-*It's a bit like evaluating a student on questions it already knows in advance.*
-</Note>
 ### Benchmarks to know in 2025
@@ -151,6 +137,3 @@ Key things I hope you'll remember are:
 To conclude: The models we build are only as good as our ability to measure what matters. Thanks for reading!
-### Acknowledgments
-Many thanks to all the people who contributed directly or indirectly to this document, notably Hynek Kydlicek, Loubna Ben Allal, Sander Land and Nathan Habib.

 ---
+title: "The Evaluation Guidebook"
+subtitle: "Understanding the tips and tricks of evaluating an LLM in 2025"
 description: "Understanding the tips and tricks of evaluating an LLM in 2025"
 authors:
   - name: "Clémentine Fourrier"
     url: "https://huggingface.co/clefourrier"
     affiliations: [1]
+  - name: "Thibaud Frère"
     url: "https://huggingface.co/tfrere"
     affiliations: [1]
   - name: "Guilherme Penedo"
 affiliations:
   - name: "Hugging Face"
     url: "https://huggingface.co"
+published: "Dec. 01, 2025"
 tags:
   - research
   - evaluation
 ## Evaluating with existing benchmarks
+Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team
 ### Benchmarks to know in 2025
 To conclude: The models we build are only as good as our ability to measure what matters. Thanks for reading!

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -8,7 +8,6 @@ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
 import Image from "../../../components/Image.astro";
 import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
 import envImage from '../../assets/image/env.png';
-import Wide from "../../../components/Wide.astro";
 ### Dataset
@@ -24,8 +23,6 @@ When aggregating datasets, pay attention to whether
 <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
-New research by EpochAI (2025) showcases how to [best aggregate benchmarks together under a single framework](https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks) to make the aggregated dataset harder overall and less prone to saturation.
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically
@@ -33,7 +30,7 @@ New research by EpochAI (2025) showcases how to [best aggregate benchmarks toget
 If your task allows, using procedurally generated benchmarks is a very good way to get a virtually infinite supply of samples and avoid contamination! They can generate unlimited fresh test cases algorithmically, while controlling difficulty and enabling automatic verification, ensuring models haven't seen examples during training.
-For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), [ZebraLogic](https://arxiv.org/pdf/2502.01100), IFEval, or GSMTemplate among others. **NPHardEval** generates complexity-grounded tasks like graph problems with automatic verification and monthly refreshes to reduce overfitting. **MuSR** creates complex reasoning instances like 1000-word murder mysteries using neurosymbolic generation. **ZebraLogic** algorithmically produces logic grid puzzles by generating solutions and iteratively minimizing clues using SAT solvers. **BabiQA** simulates entities following successions of actions. **IFEval** tests instruction-following with 500+ prompts containing verifiable constraints like word counts that can be checked programmatically. **GSM-Symbolic** uses templates to generate diverse math questions.
 Tasks which usually fit this paradigm test mathematical, logical, or coding abilities.
@@ -236,9 +233,7 @@ However, when doing evaluation with humans, you need to make sure your annotator
 Different approaches exist to evaluate models with humans in the loop.
-**Vibe-checks** is the name given to manual evaluations done by individual members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on their use cases of preference. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Said use cases can be anything from the most exciting to the most mundate - to cite some I've seen on Reddit, they covered legal questions in German, coding, tool use, quality of erotica written, etc. Often shared on forums or social media, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).
-<HtmlEmbed src="d3-vibe-checks.html"/>
 Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
@@ -252,9 +247,11 @@ Once you want to scale to more systematic evaluation with paid annotators, you'l
 Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
 However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
-Overall, however, human evaluation has a number of well known biases, based first impressions, tone, alignement with annotators value, etc, see the figure below.
-<HtmlEmbed src="d3-human-biases.html"/>
 These biases are not unexpected, but they must be taken into account: not all use cases should rely on using cheap human annotators - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark (experts, automatic metrics if applicable, etc).
@@ -285,10 +282,8 @@ People in favor of judge LLMs have been claiming they provide better:
 In my opinion, using LLM judges correctly is extremely tricky, and it's **easy to be deceived for critical use cases**:
 - LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see below). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
-- They are indeed scalable, but contribute to creating **massive amounts of data** which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
-- They are indeed cheap to instantiate, but are not as good as paying actual expert human annotators for your specific use cases.
-<HtmlEmbed src="d3-llm-biases.html"/>
 This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
@@ -421,27 +416,41 @@ You need to decide what your threshold for acceptance is. Depending on how hard
 #### Tips and tricks
-<Note title="Mitigating well known biases of LLM as judges" emoji="⚠️" variant="warning">
-We discussed in this section's [intro](http://localhost:4321/#pros-and-cons-of-using-judge-llms) a number of LLM judges biases. Let's see how you should try to mitigate them.
 **Lack of internal consistency**:
 ➡️ You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
-**Self-preference**:
 ➡️ You can mitigate this by using a jury
-**Blindness to input perturbation**:
 ➡️ asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
 ➡️ or providing a coherent grading scale in the prompt.
-**Position-bias**:
 ➡️ switching answer positions randomly
 ➡️ computing the log-probabilities of all possible choices to get a normalized answer
-**Verbosity-bias** (or length-bias):
 ➡️ You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
-**Format bias**:
 ➡️ You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
 </Note>
@@ -554,11 +563,9 @@ You can also compute these with prompt variations, by asking the same questions
 ### Cost and efficiency
-When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 1 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
-<div className="card" style="height: fit-content; max-width: 75%; margin: 40px auto;">
-<img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="height: auto !important; object-fit: contain !important; display: block; margin: 0 auto;" />
-</div>
 We suggest you report the following:
 - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.

 import Image from "../../../components/Image.astro";
 import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
 import envImage from '../../assets/image/env.png';
 ### Dataset
 <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically
 If your task allows, using procedurally generated benchmarks is a very good way to get a virtually infinite supply of samples and avoid contamination! They can generate unlimited fresh test cases algorithmically, while controlling difficulty and enabling automatic verification, ensuring models haven't seen examples during training.
+For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), [ZebraLogic](https://arxiv.org/pdf/2502.01100), IFEval, or GSMTemplate among others. **NPHardEval** generates complexity-grounded tasks like graph problems with automatic verification and monthly refreshes to reduce overfitting. **MuSR** creates complex reasoning instances like 1000-word murder mysteries using neurosymbolic generation. **ZebraLogic** algorithmically produces logic grid puzzles by generating solutions and iteratively minimizing clues using SAT solvers. **BabiQA** simulates worlds with entities and actions, with Dyna-bAbI providing fine-grained control over task generation. **IFEval** tests instruction-following with 500+ prompts containing verifiable constraints like word counts that can be checked programmatically. **GSM-Symbolic** uses templates to generate diverse math questions for controllable evaluation.
 Tasks which usually fit this paradigm test mathematical, logical, or coding abilities.
 Different approaches exist to evaluate models with humans in the loop.
+**Vibe-checks** is the name given to manual evaluations done by individual members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on their use cases of preference. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Said use cases can be anything from the most exciting to the most mundate - to cite some I've seen on Reddit, they covered legal questions in German, coding, ability to generate tikz unicorns, tool use, quality of erotica written, etc. Often shared on forums or social media, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).
 Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
 Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
 However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
+Overall, however, human evaluation has a number of well known biases:
+- **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
+- **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.)
+- **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
+- **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
 These biases are not unexpected, but they must be taken into account: not all use cases should rely on using cheap human annotators - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark (experts, automatic metrics if applicable, etc).
 In my opinion, using LLM judges correctly is extremely tricky, and it's **easy to be deceived for critical use cases**:
 - LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see below). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
+- They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
+- They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
 This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
 #### Tips and tricks
+**Mitigating well known biases of LLM as judges**
+<Note title="Known LLM judge biases and mitigations" emoji="⚠️" variant="warning">
 **Lack of internal consistency**:
+A judge might give you different judgments if you prompt it several times (if the temperature is not 0)
 ➡️ You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
+**Self-preference**
+Models tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
 ➡️ You can mitigate this by using a jury
+**Blindness to input perturbation**
+Models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale.
+Mitigations:
 ➡️ asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
 ➡️ or providing a coherent grading scale in the prompt.
+**Position-bias**.
+Models tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice. Mitigations:
 ➡️ switching answer positions randomly
 ➡️ computing the log-probabilities of all possible choices to get a normalized answer
+**Verbosity-bias** (or length-bias)
+Models tend to like more verbose answers
 ➡️ You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
+**Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
+<Sidenote> However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.</Sidenote>
+**Format bias**
+Models tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
 ➡️ You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
 </Note>
 ### Cost and efficiency
+When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
+<img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="max-width: 400px; height: auto; display: block; margin: 0 auto;" />
 We suggest you report the following:
 - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.

app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED Viewed

@@ -5,11 +5,7 @@ title: "2025 evaluations"
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
-You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).
-<Note>
-Feel free to skim this section if you're not very familiar with evaluation yet, and come back to it once you need to find a dataset for a specific capability :)
-</Note>
 #### Reasoning and commonsense
@@ -118,8 +114,6 @@ I believe that **assistant tasks** are going to be one of the main ways to do ne
 It was later replicated in [BrowseComp](https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf) (2025) which tests the same thing (can a model find the adequate answer to a specific query using tools and online information) but does not guarantee uniqueness of result, as questions were constructed by starting from the result and building a question from it, with varying levels of difficulty: for example, from a specific paper to retrieve, a question will be created by combining information about metadata, for example "which paper about Topic was published at Conference with one Nationality author and two people from Entity?" However, the benchmark is probably also harder at the moment.
-[**GDPval**](https://arxiv.org/abs/2510.04374) (2025) evaluates models on 44 occupations from the “top industries contributing to US GDP", comparing model performance with human performance using model judges.
 Lastly, [GAIA2](https://huggingface.co/blog/gaia2) went beyond simple information retrieval, using a mock up mobile environment to test how assistants are able to answer correctly answer queries relying on chains of events and tool calls. As of now, time sensitive and deliberately noisy subsets (mocking up failing API calls) are the hardest for models, when search and execution seem extremely easy for SOTA models.
 **Science assistants**
@@ -132,6 +126,8 @@ Lastly, [GAIA2](https://huggingface.co/blog/gaia2) went beyond simple informatio
 [**DABStep**](https://arxiv.org/abs/2506.23719) (2025) evaluates model on previously private (therefore uncontaminated) operational data analysis workloads using real life questions and data. All problems require multi step reasoning and varied document parsing, as well of course as specific data manipulation skills. It's a neat eval because it's hard and replicates actually useful real world use cases, and because each problem has a ground truth, so evaluation is unbiased and not too costly.
 Assistant tasks test integrated capabilities in realistic scenarios, but they're either dynamic and read only, or static in environment which doesn't change. To evaluate adaptability and dynamic decision-making, we need environments that can "surprise" the model.
 #### Game based evaluations
@@ -159,9 +155,7 @@ A similar approach is used to generate questions in [Arbitrage](https://arxiv.or
 In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
-#### Recommendations
-<Note title="TLDR" emoji="🎯" variant="info">
 The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
 As of Nov 2025, I recommend using:
@@ -176,8 +170,4 @@ The field is moving toward evaluations that test capability orchestration rather
 <Sidenote>
 I hope the field moves towards putting more emphasis on functional testing rather than model judges, and generally understandable datasets and tasks.
 </Sidenote>
-</Note>
-<Note>
-If you want to explore even more datasets, you'll find a big list of older interesting benchmarks [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md) with my notes.
-</Note>

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
+You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).
 #### Reasoning and commonsense
 It was later replicated in [BrowseComp](https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf) (2025) which tests the same thing (can a model find the adequate answer to a specific query using tools and online information) but does not guarantee uniqueness of result, as questions were constructed by starting from the result and building a question from it, with varying levels of difficulty: for example, from a specific paper to retrieve, a question will be created by combining information about metadata, for example "which paper about Topic was published at Conference with one Nationality author and two people from Entity?" However, the benchmark is probably also harder at the moment.
 Lastly, [GAIA2](https://huggingface.co/blog/gaia2) went beyond simple information retrieval, using a mock up mobile environment to test how assistants are able to answer correctly answer queries relying on chains of events and tool calls. As of now, time sensitive and deliberately noisy subsets (mocking up failing API calls) are the hardest for models, when search and execution seem extremely easy for SOTA models.
 **Science assistants**
 [**DABStep**](https://arxiv.org/abs/2506.23719) (2025) evaluates model on previously private (therefore uncontaminated) operational data analysis workloads using real life questions and data. All problems require multi step reasoning and varied document parsing, as well of course as specific data manipulation skills. It's a neat eval because it's hard and replicates actually useful real world use cases, and because each problem has a ground truth, so evaluation is unbiased and not too costly.
+[**GDPval**](https://arxiv.org/abs/2510.04374) (2025) evaluates models on 44 occupations from the “top industries contributing to US GDP", comparing model performance with human performance using model judges.
 Assistant tasks test integrated capabilities in realistic scenarios, but they're either dynamic and read only, or static in environment which doesn't change. To evaluate adaptability and dynamic decision-making, we need environments that can "surprise" the model.
 #### Game based evaluations
 In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
+<Note title="TLDR" emoji="🎯">
 The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
 As of Nov 2025, I recommend using:
 <Sidenote>
 I hope the field moves towards putting more emphasis on functional testing rather than model judges, and generally understandable datasets and tasks.
 </Sidenote>
+</Note>

app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED Viewed

@@ -52,7 +52,7 @@ I would strongly recommend reading a longer explanation on how BPE works, as it'
 - [BPE Paper (for text, as the method existed before in other fields)](https://aclanthology.org/P16-1162/)
 </Note>
-Building a tokenizer requires making more choices than one would expect. For example, to tokenize numbers, you don't want to use a basic BPE, but do you only index 0 to 9, and assume all other numbers will be compositions of digits? Do you want to store numbers up to, say, one billion, individually?
 Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. This will affect some mathematical evaluation (and is the reason why almost no evaluation is pure arithmetics).
@@ -67,9 +67,8 @@ Current well known models display a range of approaches to this, but it's unclea
 Pre-2022, models used to simply be pretrained: text in, text out, nothing else. Then, we got instruction tuning and chat models in 2023, and in 2025 reasoning models. This means that we went from using raw text to using more and more formatting.
-<Wide>
-<HtmlEmbed src="d3-tokenization-timeline.html" />
-</Wide>
 This means a number of models are going to perform terribly if you do not make sure to:
 1. respect the format the model expectes
@@ -98,7 +97,7 @@ First, as some languages do not always use spacing as a word separator (Korean,
 Then, tokenizers in general might be unfair to non-English languages. When training a BPE tokenizer, you use data from the different languages you want to cover, but most of the time, though, this data is unbalanced between languages (with, for example, an order of magnitude more English than Thai, or Burmese). Since BPE tokenizers create their vocabulary tokens based on the most frequent words seen, most of the long tokens will be English words - and most of the words from the less frequent languages will only be split at the character level. This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
 <Wide>
-<Reference align="center" caption="Very nice demo by Yennie Jun on tokenization issues across languages">
 <iframe
     className="card"
     src="https://OpenEvals-tokenizers-languages.hf.space"
@@ -133,7 +132,7 @@ From this input text, the LLM generates a probability distribution of the most l
 **Generative evaluations**: Given a prompt, what text does my model generate?
-Choice depends on your task (as we'll see below) and on your model: most models under APIs do not return the logprobabilities, so you'll need to use generative evaluations systematically to evaluate them.
 </Note>

 - [BPE Paper (for text, as the method existed before in other fields)](https://aclanthology.org/P16-1162/)
 </Note>
+When building a tokenizer require making more choices than one would expect. For example, to tokenize numbers, you don't want to use a basic BPE, but do you only index 0 to 9, and assume all other numbers will be compositions of digits? Do you want to store numbers up to, say, one billion, individually?
 Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. This will affect some mathematical evaluation (and is the reason why almost no evaluation is pure arithmetics).
 Pre-2022, models used to simply be pretrained: text in, text out, nothing else. Then, we got instruction tuning and chat models in 2023, and in 2025 reasoning models. This means that we went from using raw text to using more and more formatting.
+<HtmlEmbed src="d3-tokenization-timeline.html" frameless />
 This means a number of models are going to perform terribly if you do not make sure to:
 1. respect the format the model expectes
 Then, tokenizers in general might be unfair to non-English languages. When training a BPE tokenizer, you use data from the different languages you want to cover, but most of the time, though, this data is unbalanced between languages (with, for example, an order of magnitude more English than Thai, or Burmese). Since BPE tokenizers create their vocabulary tokens based on the most frequent words seen, most of the long tokens will be English words - and most of the words from the less frequent languages will only be split at the character level. This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
 <Wide>
+<Reference align="center" caption="OpenEvals-tokenizers-languages">
 <iframe
     className="card"
     src="https://OpenEvals-tokenizers-languages.hf.space"
 **Generative evaluations**: Given a prompt, what text does my model generate?
+Choice depends on your task: multiple-choice questions use log-likelihood, while open-ended tasks require generative evaluation.
 </Note>

app/src/content/chapters/intro.mdx CHANGED Viewed

@@ -9,7 +9,7 @@ import Quote from "../../components/Quote.astro";
 ## What is model evaluation about?
-As you navigate the world of LLMs — whether you're training or fine-tuning your own models, selecting one for your application, or trying to understand the state of the field — there is one question you have likely stumbled upon:
 <Quote>
 How can one know if a model is *good*?
@@ -17,13 +17,11 @@ How can one know if a model is *good*?
 The answer is (surprisingly given the blog topic) evaluation! It's everywhere: leaderboards ranking models, benchmarks claiming to measure *reasoning*, *knowledge*, *coding abilities* or *math performance*, papers announcing new state-of-the-art results...
-But what is evaluation, really? And what can it really tell you?
-This guide is here to help you understand it all: what evaluation can and cannot do, when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want.
-Through the guide, we'll also highlight common pitfalls, tips and tricks from the Open Evals team, and hopefully help you learn how to think critically about the claims made from evaluation results.
-<Sidenote>In this guide, we focus on evaluations for language (mostly natural language), but many principles also apply to other modalities </Sidenote>
 Before we dive into the details, let's quickly look at why people do evaluation, as who you are and what you are working on will determine which evaluations you need to use.
@@ -32,11 +30,9 @@ Before we dive into the details, let's quickly look at why people do evaluation,
 If you are a researcher or engineer creating a new model, your goal is likely to build a strong model that performs well on a set of tasks. For a base model (training from scratch), you want the model to do well on a general tasks, measuring a variety of different capabilities. If you are post-training a base model for a specific use case, you probably care more about the performance on that specific task. The way you measure performance, in either case, is through evaluations.
 As you experiment with different architectures, data mixtures, and training recipes, you want to make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties, and possibly even improved it. The way you test for the impact of different design choices is through **ablations**: an ablation is an experiment where you typically train a model under a specific setup, evaluate it on your chosen set of tasks, and compare the results to a baseline model.
-Therefore, the choice of evaluation tasks is critical for ablations, as they determine what you will be optimizing for as you create your model.
-<HtmlEmbed src="d3-ablation-workflow.html" title="Ablation example"/>
-For base models, one would typically resort to selecting standard benchmark tasks used by other model builders (think the classic list of benchmarks that are always reported when a new model is released - we'll have a look at those below). For a specific use case, you can either use existing evaluation tasks if they are available -- and you likely will want to take a good look if they are not "standard" -- or design your own (discussed below). As you will likely run a lot of ablations, you want the evaluation tasks to provide strong enough signal (and not just meaningless noisy results) and you want them to run cheaply and quickly, so that you can iterate fast.
 Through ablations, we are also able to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws.
 Besides ablations for experiments, you will likely also want to run evaluations on intermediate checkpoints as your model is training, to ensure it is properly learning and improving at the different tasks, and does not start regressing due to spikes or other issues. Finally, you want to evaluate the final checkpoint so that you can announce that your model is SOTA when you release it.
@@ -61,11 +57,6 @@ In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benc
 Similarly to model builders hillclimbing a specific capability, for less common topics, you might need to think about designing your own evaluations, which is detailed in our last section.
-<Note title="Takeaways" emoji="🎯" variant="info">
-- Model builder: You need fast, high-signal benchmarks that cover the domains/capabilities you care about and can be run repeatedly during ablations.
-- Model user: You need benchmarks that match your specific use case, even if that means creating custom ones.
-</Note>
 <Note title="What about measuring AGI?">
 We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.

 ## What is model evaluation about?
+As you navigate the world of LLMs — whether you're training or fine-tuning your own models, selecting one for your application, or trying to understand the state of the field — there is one question you have likely asked stumbled upon:
 <Quote>
 How can one know if a model is *good*?
 The answer is (surprisingly given the blog topic) evaluation! It's everywhere: leaderboards ranking models, benchmarks claiming to measure *reasoning*, *knowledge*, *coding abilities* or *math performance*, papers announcing new state-of-the-art results...
+But what is it, really? And what can it really tell you?
+This guide is here to help you understand evaluation: what it can and cannot do and when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want.
+Through the guide, we'll also highlight common pitfalls, tips and tricks from the Open Evals team, and hopefully will help you learn how to think critically about the claims made from evaluation results.
 Before we dive into the details, let's quickly look at why people do evaluation, as who you are and what you are working on will determine which evaluations you need to use.
 If you are a researcher or engineer creating a new model, your goal is likely to build a strong model that performs well on a set of tasks. For a base model (training from scratch), you want the model to do well on a general tasks, measuring a variety of different capabilities. If you are post-training a base model for a specific use case, you probably care more about the performance on that specific task. The way you measure performance, in either case, is through evaluations.
 As you experiment with different architectures, data mixtures, and training recipes, you want to make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties, and possibly even improved it. The way you test for the impact of different design choices is through **ablations**: an ablation is an experiment where you typically train a model under a specific setup, evaluate it on your chosen set of tasks, and compare the results to a baseline model.
+Therefore, the choice of evaluation tasks is critical for ablations, as they determine what you will be optimizing for as you create your model.
+For base models, one would typically resort to selecting standard benchmark tasks used by other model builders (think the classic list of benchmarks that are always reported when a new model is released). For a specific use case, you can either use existing evaluation tasks if they are available -- and you likely will want to take a good look if they are not "standard" -- or design your own (discussed below). As you will likely run a lot of ablations, you want the evaluation tasks to provide strong enough signal (and not just meaningless noisy results) and you want them to run cheaply and quickly, so that you can iterate fast.
 Through ablations, we are also able to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws.
 Besides ablations for experiments, you will likely also want to run evaluations on intermediate checkpoints as your model is training, to ensure it is properly learning and improving at the different tasks, and does not start regressing due to spikes or other issues. Finally, you want to evaluate the final checkpoint so that you can announce that your model is SOTA when you release it.
 Similarly to model builders hillclimbing a specific capability, for less common topics, you might need to think about designing your own evaluations, which is detailed in our last section.
 <Note title="What about measuring AGI?">
 We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.

app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx CHANGED Viewed

@@ -60,7 +60,7 @@ We did some experiments on this (you'll see up to a 7 points difference for the
 *Evaluation on MMLU subsets, acc_norm score (seed 0), in 5-shot.*
-<HtmlEmbed frameless src="d3-mmlu-heatmap.html" />
 This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
 </Note>

 *Evaluation on MMLU subsets, acc_norm score (seed 0), in 5-shot.*
+<HtmlEmbed src="d3-mmlu-heatmap.html" />
 This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
 </Note>

app/src/content/embeds/banner.html CHANGED Viewed

@@ -1,5 +1,4 @@
 <div class="d3-leaderboard-chart-wrapper" style="width:100%;margin:10px 0;padding:10px 5px 5px 5px;border-radius:8px;background:var(--surface-bg);border:1px solid var(--border-color);position:relative;">
-    <h3 class="d3-chart-title" style="margin:10px 0 15px 15px;font-size:16px;font-weight:600;color:var(--text-color);opacity:0.9;white-space:nowrap;text-align:left;display:block;width:100%;">The benchmark lifecycle</h3>
     <div class="d3-leaderboard-chart" style="width:100%;aspect-ratio:2.8/1;min-height:320px;"></div>
 </div>
 <style>
@@ -312,8 +311,8 @@
                     infoIcon = document.createElement('div');
                     infoIcon.className = 'd3-info-icon';
                     infoIcon.innerHTML = `
-                        <svg width="20" height="20" viewBox="0 0 20 20" fill="none" xmlns="http://www.w3.org/2000/svg">
-                            <path d="M8 6C8 4.89543 8.89543 4 10 4C11.1046 4 12 4.89543 12 6C12 7.10457 11.1046 8 10 8V10M10 14H10.01" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
                         </svg>
                     `;
                     wrapper.appendChild(infoIcon);
@@ -327,18 +326,17 @@
                             <div style="font-weight: 600; margin-bottom: 10px; color: var(--text-color); font-size: 13px; text-align: left;">About this chart</div>
                             <div style="color: var(--text-color); font-size: 12px; line-height: 1.6; text-align: left;">
                                 <p style="margin: 0 0 10px 0; text-align: left;">
-                                    This visualization tracks the evolution of top benchmark scores over time across 3 leaderboards managed by Hugging Face
-                                    through the years: the Open LLM Leaderboard 1, 2, and the GAIA leaderboard.
                                     The step-like lines represent the progression of maximum scores achieved for each benchmark, with circular markers
-                                    indicating when a new record was set. It illustrates a phenomenon known as saturation.
                                 </p>
                                 <p style="margin: 0 0 10px 0; text-align: left;">
-                                    The gray scatter plot in the background shows the average scores of all evaluated models for a given leaderboard
-                                    at a given time, and allows to follow the trend of submission for each leaderboard.
                                 </p>
                                 <p style="margin: 0; text-align: left;">
                                     Benchmarks are grouped by category (Reasoning & Commonsense, Knowledge, Math, Agentic, and Instruction following),
-                                    with each group sharing a color family.
                                 </p>
                             </div>
                         `;
@@ -736,6 +734,7 @@
                     .attr('stroke-width', 1)
                     .attr('stroke-dasharray', '2,2');
                 // Line generator - courbe en escalier (step) pour afficher des seuils successifs
                 // La ligne reste constante jusqu'au prochain point
                 const line = d3.line()
@@ -771,8 +770,6 @@
                             g.selectAll(`.legend-${displayName}`).style('opacity', 0.3);
                 }
                     });
-                    // Ghost aussi les nuages de points
-                    g.selectAll('.scatter-point').style('opacity', 0.1);
                 };
                 const resetHighlight = () => {
@@ -782,8 +779,6 @@
                         const displayName = benchmark === 'MMLU_new' ? 'MMLU-Pro' : benchmark;
                         g.selectAll(`.legend-${displayName}`).style('opacity', 1);
                     });
-                    // Réinitialiser aussi les nuages de points
-                    g.selectAll('.scatter-point').style('opacity', 1);
                 };
                 // Ajouter le nuage de points EN PREMIER (en dessous de tout)
@@ -939,7 +934,7 @@
                     .style('color', 'var(--text-color)')
                     .style('opacity', '0.8')
                     .style('margin-bottom', '8px')
-                    .text('Domains');
                 const legendDiv = legendWrapper.append('xhtml:div')
                     .style('display', 'flex')
@@ -1045,31 +1040,11 @@
                             .style('left', `${left}px`)
                             .style('top', `${top}px`);
-                        // Highlight TOUS les benchmarks du groupe en même temps
-                        // D'abord, obtenir les clés de données pour tous les benchmarks du groupe
-                        const groupBenchmarkKeys = group.benchmarks.map(benchmark => {
-                            return benchmark === 'MMLU-Pro' ? 'MMLU_new' : benchmark;
-                        });
-                        // Mettre en évidence tous les benchmarks du groupe
-                        benchmarks.forEach(benchmark => {
-                            const displayName = benchmark === 'MMLU_new' ? 'MMLU-Pro' : benchmark;
-                            const isInGroup = groupBenchmarkKeys.includes(benchmark);
-                            if (isInGroup) {
-                                // Mettre en évidence la ligne sélectionnée
-                                g.selectAll(`.line-${benchmark}`).style('opacity', 1).attr('stroke-width', 3);
-                                g.selectAll(`.marker-${benchmark}`).style('opacity', 1);
-                                g.selectAll(`.legend-${displayName}`).style('opacity', 1);
-                            } else {
-                                // Ghost les autres lignes
-                                g.selectAll(`.line-${benchmark}`).style('opacity', 0.15);
-                                g.selectAll(`.marker-${benchmark}`).style('opacity', 0.15);
-                                g.selectAll(`.legend-${displayName}`).style('opacity', 0.3);
-                            }
                         });
-                        // Ghost aussi les nuages de points
-                        g.selectAll('.scatter-point').style('opacity', 0.1);
                     }).on('mouseleave', function() {
                         d3.select(legendTooltip).style('opacity', '0');
                         resetHighlight();

 <div class="d3-leaderboard-chart-wrapper" style="width:100%;margin:10px 0;padding:10px 5px 5px 5px;border-radius:8px;background:var(--surface-bg);border:1px solid var(--border-color);position:relative;">
     <div class="d3-leaderboard-chart" style="width:100%;aspect-ratio:2.8/1;min-height:320px;"></div>
 </div>
 <style>
                     infoIcon = document.createElement('div');
                     infoIcon.className = 'd3-info-icon';
                     infoIcon.innerHTML = `
+                        <svg width="16" height="16" viewBox="0 0 16 16" fill="none" xmlns="http://www.w3.org/2000/svg">
+                            <path d="M8 6V8M8 10H8.01" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/>
                         </svg>
                     `;
                     wrapper.appendChild(infoIcon);
                             <div style="font-weight: 600; margin-bottom: 10px; color: var(--text-color); font-size: 13px; text-align: left;">About this chart</div>
                             <div style="color: var(--text-color); font-size: 12px; line-height: 1.6; text-align: left;">
                                 <p style="margin: 0 0 10px 0; text-align: left;">
+                                    This visualization tracks the evolution of top benchmark scores over time across multiple evaluation frameworks.
                                     The step-like lines represent the progression of maximum scores achieved for each benchmark, with circular markers
+                                    indicating when a new record was set.
                                 </p>
                                 <p style="margin: 0 0 10px 0; text-align: left;">
+                                    The gray scatter plot in the background shows the average scores of all evaluated models, providing context for
+                                    the top performers. Each point represents a model's average performance across all benchmarks at a given time.
                                 </p>
                                 <p style="margin: 0; text-align: left;">
                                     Benchmarks are grouped by category (Reasoning & Commonsense, Knowledge, Math, Agentic, and Instruction following),
+                                    with each group sharing a color family. Variations within a group use different shades of the same base color.
                                 </p>
                             </div>
                         `;
                     .attr('stroke-width', 1)
                     .attr('stroke-dasharray', '2,2');
                 // Line generator - courbe en escalier (step) pour afficher des seuils successifs
                 // La ligne reste constante jusqu'au prochain point
                 const line = d3.line()
                             g.selectAll(`.legend-${displayName}`).style('opacity', 0.3);
                 }
                     });
                 };
                 const resetHighlight = () => {
                         const displayName = benchmark === 'MMLU_new' ? 'MMLU-Pro' : benchmark;
                         g.selectAll(`.legend-${displayName}`).style('opacity', 1);
                     });
                 };
                 // Ajouter le nuage de points EN PREMIER (en dessous de tout)
                     .style('color', 'var(--text-color)')
                     .style('opacity', '0.8')
                     .style('margin-bottom', '8px')
+                    .text('Legend');
                 const legendDiv = legendWrapper.append('xhtml:div')
                     .style('display', 'flex')
                             .style('left', `${left}px`)
                             .style('top', `${top}px`);
+                        // Highlight tous les benchmarks du groupe
+                        group.benchmarks.forEach(benchmark => {
+                            const displayName = benchmark;
+                            highlightBenchmark(displayName);
                         });
                     }).on('mouseleave', function() {
                         d3.select(legendTooltip).style('opacity', '0');
                         resetHighlight();

app/src/content/embeds/d3-ablation-workflow.html DELETED Viewed

@@ -1,474 +0,0 @@
-<div class="d3-ablation-workflow"></div>
-<style>
-  .d3-ablation-workflow {
-    font-family: var(--default-font-family);
-    background: transparent;
-    border: none;
-    border-radius: 0;
-    padding: var(--spacing-4) 0;
-    width: 100%;
-    margin: 0 auto;
-    position: relative;
-  }
-  .d3-ablation-workflow svg {
-    width: 100%;
-    height: auto;
-    display: block;
-  }
-  .d3-ablation-workflow .stage-box {
-    stroke-width: 2;
-    transition: all 0.3s ease;
-  }
-  .d3-ablation-workflow .stage-box:hover {
-    filter: brightness(1.1);
-    stroke-width: 3;
-  }
-  .d3-ablation-workflow .stage-label {
-    fill: var(--text-color);
-    font-size: 12px;
-    font-weight: 700;
-    pointer-events: none;
-    user-select: none;
-    text-transform: uppercase;
-    letter-spacing: 0.05em;
-  }
-  .d3-ablation-workflow .item-label {
-    fill: var(--text-color);
-    font-size: 11px;
-    font-weight: 600;
-    pointer-events: none;
-    user-select: none;
-  }
-  .d3-ablation-workflow .arrow-line {
-    fill: none;
-    stroke-width: 2;
-    transition: all 0.3s ease;
-  }
-  .d3-ablation-workflow .marker {
-    opacity: 0.7;
-  }
-  .d3-ablation-workflow .training-curve {
-    fill: none;
-    stroke-width: 2;
-    transition: all 0.3s ease;
-  }
-  .d3-ablation-workflow .score-bar {
-    transition: all 0.3s ease;
-  }
-  .d3-ablation-workflow .score-bar:hover {
-    filter: brightness(1.15);
-  }
-  .d3-ablation-workflow .score-text {
-    fill: var(--text-color);
-    font-size: 10px;
-    font-weight: 600;
-    pointer-events: none;
-    user-select: none;
-  }
-  .d3-ablation-workflow .axis-label {
-    fill: var(--muted-color);
-    font-size: 9px;
-    font-weight: 500;
-    pointer-events: none;
-    user-select: none;
-  }
-  .d3-ablation-workflow .legend-text {
-    font-size: 13px;
-    line-height: 1.6;
-    color: var(--text-color);
-    text-align: center;
-    margin-top: var(--spacing-3);
-    padding: 0 var(--spacing-4);
-  }
-  .d3-ablation-workflow .d3-tooltip {
-    position: absolute;
-    background: var(--surface-bg);
-    border: 1px solid var(--border-color);
-    border-radius: 8px;
-    padding: 8px 10px;
-    font-size: 12px;
-    pointer-events: none;
-    opacity: 0;
-    transition: opacity 0.12s ease;
-    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
-    z-index: 1000;
-    max-width: 300px;
-    line-height: 1.35;
-    white-space: pre-line;
-    color: var(--text-color);
-    transform: translate(-9999px, -9999px);
-  }
-  @media (max-width: 768px) {
-    .d3-ablation-workflow .stage-label {
-      font-size: 10px;
-    }
-    .d3-ablation-workflow .item-label {
-      font-size: 10px;
-    }
-    .d3-ablation-workflow .score-text {
-      font-size: 9px;
-    }
-  }
-</style>
-<script>
-  (() => {
-    const ensureD3 = (cb) => {
-      if (window.d3 && typeof window.d3.select === 'function') return cb();
-      let s = document.getElementById('d3-cdn-script');
-      if (!s) {
-        s = document.createElement('script');
-        s.id = 'd3-cdn-script';
-        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
-        document.head.appendChild(s);
-      }
-      const onReady = () => {
-        if (window.d3 && typeof window.d3.select === 'function') cb();
-      };
-      s.addEventListener('load', onReady, { once: true });
-      if (window.d3) onReady();
-    };
-    const bootstrap = () => {
-      const scriptEl = document.currentScript;
-      let container = scriptEl ? scriptEl.previousElementSibling : null;
-      if (!(container && container.classList && container.classList.contains('d3-ablation-workflow'))) {
-        const candidates = Array.from(document.querySelectorAll('.d3-ablation-workflow'))
-          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
-        container = candidates[candidates.length - 1] || null;
-      }
-      if (!container) return;
-      if (container.dataset) {
-        if (container.dataset.mounted === 'true') return;
-        container.dataset.mounted = 'true';
-      }
-      container.style.position = container.style.position || 'relative';
-      // Get colors from ColorPalettes or fallback
-      const getColors = () => {
-        if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
-          return window.ColorPalettes.getColors('categorical', 3);
-        }
-        return ['#1f77b4', '#ff7f0e', '#2ca02c'];
-      };
-      // Data for two ablations: Wikipedia vs Reddit
-      const ablations = [
-        {
-          id: 'wiki',
-          name: 'Wikipedia',
-          color_idx: 0,
-          trainingData: [
-            { step: 0, loss: 4.5 },
-            { step: 1000, loss: 3.2 },
-            { step: 2000, loss: 2.4 },
-            { step: 3000, loss: 1.9 },
-            { step: 4000, loss: 1.5 },
-            { step: 5000, loss: 1.3 }
-          ],
-          finalScore: 72
-        },
-        {
-          id: 'reddit',
-          name: 'Reddit',
-          color_idx: 1,
-          trainingData: [
-            { step: 0, loss: 4.5 },
-            { step: 1000, loss: 3.5 },
-            { step: 2000, loss: 2.8 },
-            { step: 3000, loss: 2.3 },
-            { step: 4000, loss: 2.0 },
-            { step: 5000, loss: 1.8 }
-          ],
-          finalScore: 65
-        }
-      ];
-      const svg = d3.select(container).append('svg');
-      const g = svg.append('g');
-      // Add legend text below the chart
-      const legendDiv = document.createElement('div');
-      legendDiv.className = 'legend-text';
-      legendDiv.textContent = 'Say you want to compare dataset A and dataset B (for example, Wikipedia vs Reddit) to see how they affect model performance. You train models under the same setups on each, then evaluate and compare the scores on benchmarks.';
-      container.appendChild(legendDiv);
-      // Arrow markers
-      const defs = svg.append('defs');
-      getColors().forEach((color, i) => {
-        defs.append('marker')
-          .attr('id', `arrow-ablation-${i}`)
-          .attr('viewBox', '0 -5 10 10')
-          .attr('refX', 9)
-          .attr('refY', 0)
-          .attr('markerWidth', 10)
-          .attr('markerHeight', 10)
-          .attr('orient', 'auto')
-          .append('path')
-          .attr('d', 'M0,-5L10,0L0,5')
-          .attr('fill', color)
-          .attr('fill-opacity', 0.8);
-      });
-      // Big arrow marker for the right side
-      defs.append('marker')
-        .attr('id', 'arrow-big')
-        .attr('viewBox', '0 -5 10 10')
-        .attr('refX', 9)
-        .attr('refY', 0)
-        .attr('markerWidth', 10)
-        .attr('markerHeight', 10)
-        .attr('orient', 'auto')
-        .append('path')
-        .attr('d', 'M0,-5L10,0L0,5')
-        .attr('fill', 'var(--primary-color)')
-        .attr('fill-opacity', 0.8);
-      let width = 800;
-      let height = 400;
-      // Icons as SVG paths
-      const iconPaths = {
-        database: 'M12 2C6.48 2 2 5.02 2 8.5V15.5C2 18.98 6.48 22 12 22C17.52 22 22 18.98 22 15.5V8.5C22 5.02 17.52 2 12 2ZM12 4C16.42 4 20 6.24 20 8.5C20 10.76 16.42 13 12 13C7.58 13 4 10.76 4 8.5C4 6.24 7.58 4 12 4ZM4 11.03C5.89 12.33 8.78 13 12 13C15.22 13 18.11 12.33 20 11.03V15.5C20 17.76 16.42 20 12 20C7.58 20 4 17.76 4 15.5V11.03Z',
-        chart: 'M3 13h2v7H3v-7zm4-6h2v13H7V7zm4-4h2v17h-2V3zm4 8h2v9h-2v-9z'
-      };
-      // Function to draw a simple neural network schematic
-      function drawModelSchematic(g, x, y, size, color) {
-        const layers = [3, 4, 3]; // neurons per layer
-        const layerSpacing = size / 3;
-        const neuronRadius = size / 25;
-        layers.forEach((neuronsCount, layerIdx) => {
-          const layerX = x + layerIdx * layerSpacing;
-          const neuronSpacing = size / (neuronsCount + 1);
-          for (let i = 0; i < neuronsCount; i++) {
-            const neuronY = y + (i + 1) * neuronSpacing;
-            // Draw connections to next layer
-            if (layerIdx < layers.length - 1) {
-              const nextLayerX = x + (layerIdx + 1) * layerSpacing;
-              const nextNeuronSpacing = size / (layers[layerIdx + 1] + 1);
-              for (let j = 0; j < layers[layerIdx + 1]; j++) {
-                const nextNeuronY = y + (j + 1) * nextNeuronSpacing;
-                g.append('line')
-                  .attr('x1', layerX)
-                  .attr('y1', neuronY)
-                  .attr('x2', nextLayerX)
-                  .attr('y2', nextNeuronY)
-                  .attr('stroke', color)
-                  .attr('stroke-width', 0.5)
-                  .attr('opacity', 0.3);
-              }
-            }
-            // Draw neuron
-            g.append('circle')
-              .attr('cx', layerX)
-              .attr('cy', neuronY)
-              .attr('r', neuronRadius)
-              .attr('fill', color)
-              .attr('opacity', 0.8);
-          }
-        });
-      }
-      function render() {
-        width = container.clientWidth || 800;
-        height = Math.max(300, Math.round(width * 0.45));
-        svg.attr('width', width).attr('height', height);
-        const margin = { top: 40, right: 20, bottom: 20, left: 20 };
-        const innerWidth = width - margin.left - margin.right;
-        const innerHeight = height - margin.top - margin.bottom;
-        g.attr('transform', `translate(${margin.left},${margin.top})`);
-        // Clear previous content
-        g.selectAll('*').remove();
-        const colors = getColors();
-        // Three columns: Data, Training, Scores
-        const colWidth = innerWidth / 3;
-        const col1X = colWidth * 0.5;
-        const col2X = colWidth * 1.5;
-        const col3X = colWidth * 2.5;
-        // Stage titles
-        g.selectAll('.stage-label')
-          .data([
-            { x: col1X, label: 'DATA' },
-            { x: col2X, label: 'TRAINING' },
-            { x: col3X, label: 'EVALUATION' }
-          ])
-          .join('text')
-          .attr('class', 'stage-label')
-          .attr('x', d => d.x)
-          .attr('y', -20)
-          .attr('text-anchor', 'middle')
-          .text(d => d.label);
-        // Column 1: Data icons
-        const dataY = innerHeight * 0.3;
-        const dataSpacing = innerHeight * 0.35;
-        ablations.forEach((abl, i) => {
-          const y = dataY + i * dataSpacing;
-          const iconSize = 30;
-          const boxPadding = 10;
-          // Data box
-          const dataGroup = g.append('g')
-            .attr('transform', `translate(${col1X - iconSize / 2 - boxPadding},${y - iconSize / 2 - boxPadding})`);
-          dataGroup.append('rect')
-            .attr('class', 'stage-box')
-            .attr('width', iconSize + boxPadding * 2)
-            .attr('height', iconSize + boxPadding * 2)
-            .attr('rx', 8)
-            .attr('fill', colors[abl.color_idx])
-            .attr('fill-opacity', 0.15)
-            .attr('stroke', colors[abl.color_idx]);
-          // Database icon
-          dataGroup.append('path')
-            .attr('d', iconPaths.database)
-            .attr('transform', `translate(${boxPadding},${boxPadding}) scale(${iconSize / 24})`)
-            .attr('fill', colors[abl.color_idx]);
-          // Label below
-          g.append('text')
-            .attr('class', 'item-label')
-            .attr('x', col1X)
-            .attr('y', y + iconSize + boxPadding + 15)
-            .attr('text-anchor', 'middle')
-            .attr('fill', colors[abl.color_idx])
-            .text(abl.name);
-        });
-        // Column 2: Model schematics for training
-        const modelSize = Math.min(80, colWidth * 0.4);
-        ablations.forEach((abl, i) => {
-          const y = dataY + i * dataSpacing;
-          const modelX = col2X - modelSize / 2.5;
-          const modelY = y - modelSize / 2;
-          // Draw model schematic
-          const modelGroup = g.append('g');
-          drawModelSchematic(modelGroup, modelX, modelY, modelSize, colors[abl.color_idx]);
-        });
-        // Column 3: Final scores (bar chart)
-        const barWidth = 40;
-        const barMaxHeight = innerHeight * 0.6;
-        const barY = innerHeight * 0.7;
-        const scoreScale = d3.scaleLinear()
-          .domain([0, 100])
-          .range([0, barMaxHeight]);
-        ablations.forEach((abl, i) => {
-          const x = col3X - (ablations.length * barWidth) / 2 + i * barWidth + barWidth / 2;
-          const barHeight = scoreScale(abl.finalScore);
-          // Bar
-          g.append('rect')
-            .attr('class', 'score-bar')
-            .attr('x', x - barWidth / 2 + 5)
-            .attr('y', barY - barHeight)
-            .attr('width', barWidth - 10)
-            .attr('height', barHeight)
-            .attr('rx', 4)
-            .attr('fill', colors[abl.color_idx])
-            .attr('fill-opacity', 0.7);
-          // Score text
-          g.append('text')
-            .attr('class', 'score-text')
-            .attr('x', x)
-            .attr('y', barY - barHeight - 5)
-            .attr('text-anchor', 'middle')
-            .attr('fill', colors[abl.color_idx])
-            .text(`${abl.finalScore}%`);
-        });
-        // Draw arrows connecting stages
-        const iconSize = 30;
-        const boxPadding = 10;
-        // Left side: Individual arrows from data to models (with arrowheads)
-        // Stop the arrows 15px before the model to avoid covering the neural net
-        ablations.forEach((abl, i) => {
-          const y = dataY + i * dataSpacing;
-          const dataEndX = col1X + iconSize / 2 + boxPadding;
-          const modelStartX = col2X - modelSize / 2 - 5;
-          g.append('path')
-            .attr('class', 'arrow-line')
-            .attr('d', `M ${dataEndX} ${y} L ${modelStartX} ${y}`)
-            .attr('stroke', colors[abl.color_idx])
-            .attr('stroke-width', 3)
-            .attr('stroke-opacity', 0.5)
-            .attr('marker-end', `url(#arrow-ablation-${abl.color_idx})`);
-        });
-        // Right side: Single big arrow from training column to evaluation column
-        const modelEndX = col2X + modelSize / 2;
-        const evalStartX = col3X - (ablations.length * barWidth) / 2 - 20;
-        const arrowY = (dataY + dataY + (ablations.length - 1) * dataSpacing) / 2; // Middle between all items
-        g.append('path')
-          .attr('class', 'arrow-line')
-          .attr('d', `M ${modelEndX} ${arrowY} L ${evalStartX} ${arrowY}`)
-          .attr('stroke', 'var(--primary-color)')
-          .attr('stroke-width', 4)
-          .attr('stroke-opacity', 0.6)
-          .attr('marker-end', 'url(#arrow-big)');
-      }
-      render();
-      // Responsive handling
-      if (window.ResizeObserver) {
-        const ro = new ResizeObserver(() => render());
-        ro.observe(container);
-      } else {
-        window.addEventListener('resize', render);
-      }
-    };
-    if (document.readyState === 'loading') {
-      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
-    } else {
-      ensureD3(bootstrap);
-    }
-  })();
-</script>

app/src/content/embeds/d3-human-biases.html DELETED Viewed

@@ -1,352 +0,0 @@
-<div class="d3-human-biases"></div>
-<style>
-  .d3-human-biases {
-    font-family: var(--default-font-family);
-    background: transparent !important;
-    border: none !important;
-    border-radius: 0 !important;
-    padding: var(--spacing-4) 0;
-    width: 100%;
-    margin: 0 auto;
-    position: relative;
-    box-shadow: none !important;
-  }
-  .d3-human-biases svg {
-    width: 100%;
-    height: auto;
-    display: block;
-  }
-  .d3-human-biases .card-rect {
-    stroke-width: 2;
-    transition: all 0.3s ease;
-  }
-  .d3-human-biases .bias-title {
-    fill: var(--text-color);
-    font-size: 12px;
-    font-weight: 700;
-  }
-  .d3-human-biases .bias-description {
-    fill: var(--text-color);
-    font-size: 10px;
-    font-weight: 400;
-    line-height: 1.4;
-  }
-  .d3-human-biases .header-text {
-    fill: var(--text-color);
-    font-size: 12px;
-    font-weight: 700;
-    text-transform: uppercase;
-    letter-spacing: 0.05em;
-  }
-  .d3-human-biases .example-label {
-    fill: var(--muted-color);
-    font-size: 9px;
-    font-weight: 600;
-    text-transform: uppercase;
-    letter-spacing: 0.05em;
-  }
-  @media (max-width: 768px) {
-    .d3-human-biases .bias-title {
-      font-size: 10px;
-    }
-    .d3-human-biases .bias-description {
-      font-size: 9px;
-    }
-  }
-</style>
-<script>
-  (() => {
-    const ensureD3 = (cb) => {
-      if (window.d3 && typeof window.d3.select === 'function') return cb();
-      let s = document.getElementById('d3-cdn-script');
-      if (!s) {
-        s = document.createElement('script');
-        s.id = 'd3-cdn-script';
-        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
-        document.head.appendChild(s);
-      }
-      const onReady = () => {
-        if (window.d3 && typeof window.d3.select === 'function') cb();
-      };
-      s.addEventListener('load', onReady, { once: true });
-      if (window.d3) onReady();
-    };
-    const bootstrap = () => {
-      const scriptEl = document.currentScript;
-      let container = scriptEl ? scriptEl.previousElementSibling : null;
-      if (!(container && container.classList && container.classList.contains('d3-human-biases'))) {
-        const candidates = Array.from(document.querySelectorAll('.d3-human-biases'))
-          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
-        container = candidates[candidates.length - 1] || null;
-      }
-      if (!container) return;
-      if (container.dataset) {
-        if (container.dataset.mounted === 'true') return;
-        container.dataset.mounted = 'true';
-      }
-      // Get colors from ColorPalettes or fallback
-      const getColors = () => {
-        if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
-          return window.ColorPalettes.getColors('categorical', 4);
-        }
-        return ['#e74c3c', '#3498db', '#9b59b6', '#f39c12'];
-      };
-      // Human evaluation biases
-      const biases = [
-        {
-          id: 'first-impression',
-          title: 'First Impressions',
-          description: 'Quality estimated from first impressions rather than actual content',
-          example: 'Well-formatted answer rated higher despite errors',
-          reference: 'arxiv.org/abs/2309.16349'
-        },
-        {
-          id: 'tone',
-          title: 'Tone Bias',
-          description: 'Underestimation of the number of factual or logical errors in an assertive answer',
-          example: 'Assertive wrong answer > Neutral correct answer',
-          reference: 'arxiv.org/abs/2309.16349'
-        },
-        {
-          id: 'self-preference',
-          title: 'Self-Preference',
-          description: 'Preference for answers aligning with own views, opinons and beliefs',
-          example: 'Personal beliefs > Factual correctness',
-          reference: 'arxiv.org/abs/2310.13548'
-        },
-        {
-          id: 'identity',
-          title: 'Identity Bias',
-          description: 'Different identity groups rate answers differently',
-          example: 'Varied toxicity ratings across demographics',
-          reference: 'arxiv.org/abs/2205.00501',
-          reference2: 'arxiv.org/abs/2404.16019'
-        }
-      ];
-      const svg = d3.select(container).append('svg');
-      const g = svg.append('g');
-      let width = 800;
-      let height = 300;
-      // Helper function to wrap text
-      function wrapText(text, width) {
-        text.each(function() {
-          const text = d3.select(this);
-          const words = text.text().split(/\s+/).reverse();
-          let word;
-          let line = [];
-          let lineNumber = 0;
-          const lineHeight = 1.3;
-          const y = text.attr('y');
-          const x = text.attr('x');
-          const dy = parseFloat(text.attr('dy') || 0);
-          let tspan = text.text(null).append('tspan')
-            .attr('x', x)
-            .attr('y', y)
-            .attr('dy', dy + 'em');
-          while ((word = words.pop())) {
-            line.push(word);
-            tspan.text(line.join(' '));
-            if (tspan.node().getComputedTextLength() > width) {
-              line.pop();
-              tspan.text(line.join(' '));
-              line = [word];
-              tspan = text.append('tspan')
-                .attr('x', x)
-                .attr('y', y)
-                .attr('dy', ++lineNumber * lineHeight + dy + 'em')
-                .text(word);
-            }
-          }
-        });
-      }
-      function render() {
-        width = container.clientWidth || 800;
-        height = Math.max(320, Math.round(width * 0.4));
-        svg.attr('width', width).attr('height', height);
-        const margin = { top: 40, right: 20, bottom: 20, left: 20 };
-        const innerWidth = width - margin.left - margin.right;
-        const innerHeight = height - margin.top - margin.bottom;
-        g.attr('transform', `translate(${margin.left},${margin.top})`);
-        // Clear previous content
-        g.selectAll('*').remove();
-        const colors = getColors();
-        // Header
-        g.append('text')
-          .attr('class', 'header-text')
-          .attr('x', innerWidth / 2)
-          .attr('y', -15)
-          .attr('text-anchor', 'middle')
-          .text('HUMAN EVALUATION BIASES');
-        // Calculate card dimensions - 2x2 grid
-        const cols = 2;
-        const rows = 2;
-        const cardSpacingX = Math.min(20, innerWidth * 0.03);
-        const cardSpacingY = Math.min(15, innerHeight * 0.05);
-        const cardWidth = (innerWidth - cardSpacingX * (cols - 1)) / cols;
-        const cardHeight = (innerHeight - cardSpacingY * (rows - 1)) / rows;
-        // Draw cards in 2x2 grid
-        biases.forEach((bias, i) => {
-          const col = i % cols;
-          const row = Math.floor(i / cols);
-          const x = col * (cardWidth + cardSpacingX);
-          const y = row * (cardHeight + cardSpacingY);
-          const cardGroup = g.append('g')
-            .attr('transform', `translate(${x},${y})`);
-          // Card background with frame
-          cardGroup.append('rect')
-            .attr('class', 'card-rect')
-            .attr('width', cardWidth)
-            .attr('height', cardHeight)
-            .attr('rx', 12)
-            .attr('fill', colors[i])
-            .attr('fill-opacity', 0.12)
-            .attr('stroke', colors[i])
-            .attr('stroke-opacity', 0.6)
-            .attr('stroke-width', 2);
-          // Title
-          cardGroup.append('text')
-            .attr('class', 'bias-title')
-            .attr('x', cardWidth / 2)
-            .attr('y', 20)
-            .attr('text-anchor', 'middle')
-            .text(bias.title);
-          // Description with wrapping
-          const descText = cardGroup.append('text')
-            .attr('class', 'bias-description')
-            .attr('x', cardWidth / 2)
-            .attr('y', 38)
-            .attr('text-anchor', 'middle')
-            .attr('dy', 0)
-            .text(bias.description);
-          wrapText(descText, cardWidth - 20);
-          // Example box
-          const exampleY = cardHeight - 52;
-          const exampleHeight = 22;
-          cardGroup.append('rect')
-            .attr('x', 8)
-            .attr('y', exampleY)
-            .attr('width', cardWidth - 16)
-            .attr('height', exampleHeight)
-            .attr('rx', 4)
-            .attr('fill', colors[i])
-            .attr('fill-opacity', 0.15)
-            .attr('stroke', colors[i])
-            .attr('stroke-width', 1)
-            .attr('stroke-opacity', 0.4);
-          // Example text
-          const exampleText = cardGroup.append('text')
-            .attr('class', 'bias-description')
-            .attr('x', cardWidth / 2)
-            .attr('y', exampleY + 13)
-            .attr('text-anchor', 'middle')
-            .attr('dominant-baseline', 'middle')
-            .attr('font-size', 9)
-            .text(bias.example);
-          // Reference links (if exist)
-          if (bias.reference) {
-            const refLink1 = cardGroup.append('a')
-              .attr('href', `https://${bias.reference}`)
-              .attr('target', '_blank')
-              .attr('rel', 'noopener noreferrer');
-            refLink1.append('text')
-              .attr('class', 'example-label')
-              .attr('x', cardWidth - 10)
-              .attr('y', bias.reference2 ? cardHeight - 18 : cardHeight - 8)
-              .attr('text-anchor', 'end')
-              .attr('font-size', 8)
-              .attr('fill', colors[i])
-              .attr('opacity', 0.7)
-              .style('cursor', 'pointer')
-              .style('text-decoration', 'underline')
-              .text(bias.reference)
-              .on('mouseenter', function() {
-                d3.select(this).attr('opacity', 1);
-              })
-              .on('mouseleave', function() {
-                d3.select(this).attr('opacity', 0.7);
-              });
-          }
-          if (bias.reference2) {
-            const refLink2 = cardGroup.append('a')
-              .attr('href', `https://${bias.reference2}`)
-              .attr('target', '_blank')
-              .attr('rel', 'noopener noreferrer');
-            refLink2.append('text')
-              .attr('class', 'example-label')
-              .attr('x', cardWidth - 10)
-              .attr('y', cardHeight - 8)
-              .attr('text-anchor', 'end')
-              .attr('font-size', 8)
-              .attr('fill', colors[i])
-              .attr('opacity', 0.7)
-              .style('cursor', 'pointer')
-              .style('text-decoration', 'underline')
-              .text(bias.reference2)
-              .on('mouseenter', function() {
-                d3.select(this).attr('opacity', 1);
-              })
-              .on('mouseleave', function() {
-                d3.select(this).attr('opacity', 0.7);
-              });
-          }
-        });
-      }
-      render();
-      // Responsive handling
-      if (window.ResizeObserver) {
-        const ro = new ResizeObserver(() => render());
-        ro.observe(container);
-      } else {
-        window.addEventListener('resize', render);
-      }
-    };
-    if (document.readyState === 'loading') {
-      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
-    } else {
-      ensureD3(bootstrap);
-    }
-  })();
-</script>

app/src/content/embeds/d3-llm-biases.html DELETED Viewed

@@ -1,378 +0,0 @@
-<div class="d3-llm-biases"></div>
-<style>
-  .d3-llm-biases {
-    font-family: var(--default-font-family);
-    background: transparent !important;
-    border: none !important;
-    border-radius: 0 !important;
-    padding: var(--spacing-4) 0;
-    width: 100%;
-    margin: 0 auto;
-    position: relative;
-    box-shadow: none !important;
-  }
-  .d3-llm-biases svg {
-    width: 100%;
-    height: auto;
-    display: block;
-  }
-  .d3-llm-biases .card-rect {
-    stroke-width: 2;
-    transition: all 0.3s ease;
-  }
-  .d3-llm-biases .bias-title {
-    fill: var(--text-color);
-    font-size: 12px;
-    font-weight: 700;
-  }
-  .d3-llm-biases .bias-description {
-    fill: var(--text-color);
-    font-size: 10px;
-    font-weight: 400;
-    line-height: 1.4;
-  }
-  .d3-llm-biases .header-text {
-    fill: var(--text-color);
-    font-size: 12px;
-    font-weight: 700;
-    text-transform: uppercase;
-    letter-spacing: 0.05em;
-  }
-  .d3-llm-biases .example-label {
-    fill: var(--muted-color);
-    font-size: 9px;
-    font-weight: 600;
-    text-transform: uppercase;
-    letter-spacing: 0.05em;
-  }
-  @media (max-width: 768px) {
-    .d3-llm-biases .bias-title {
-      font-size: 10px;
-    }
-    .d3-llm-biases .bias-description {
-      font-size: 9px;
-    }
-  }
-</style>
-<script>
-  (() => {
-    const ensureD3 = (cb) => {
-      if (window.d3 && typeof window.d3.select === 'function') return cb();
-      let s = document.getElementById('d3-cdn-script');
-      if (!s) {
-        s = document.createElement('script');
-        s.id = 'd3-cdn-script';
-        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
-        document.head.appendChild(s);
-      }
-      const onReady = () => {
-        if (window.d3 && typeof window.d3.select === 'function') cb();
-      };
-      s.addEventListener('load', onReady, { once: true });
-      if (window.d3) onReady();
-    };
-    const bootstrap = () => {
-      const scriptEl = document.currentScript;
-      let container = scriptEl ? scriptEl.previousElementSibling : null;
-      if (!(container && container.classList && container.classList.contains('d3-llm-biases'))) {
-        const candidates = Array.from(document.querySelectorAll('.d3-llm-biases'))
-          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
-        container = candidates[candidates.length - 1] || null;
-      }
-      if (!container) return;
-      if (container.dataset) {
-        if (container.dataset.mounted === 'true') return;
-        container.dataset.mounted = 'true';
-      }
-      // Get colors from ColorPalettes or fallback
-      const getColors = () => {
-        if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
-          return window.ColorPalettes.getColors('categorical', 8);
-        }
-        return ['#e74c3c', '#3498db', '#9b59b6', '#f39c12', '#1abc9c', '#e67e22', '#95a5a6', '#34495e'];
-      };
-      // LLM judge biases - first 4 for row 1, remaining 3 for row 2
-      const biases = [
-        {
-          id: 'internal-consistency',
-          title: 'No Internal Consistency',
-          description: 'Gives different judgements if prompted multiple times (at T>0)',
-          reference: null
-        },        {
-          id: 'inconsistent-score-range',
-          title: 'No Consistent Score Ranges',
-          description: 'Model ranking do not follow a consistent scale (e.g: for a task where scores should be 1, 2, 3, 4, ... 10, the model might score 1, 1, 1, 10, 10 ... 10)',
-          reference: 'x.com/aparnadhinak/status/1748368364395721128',
-          reference2: 'github.com/LeonEricsson/llmjudge'
-        },
-        {
-          id: 'self-preference',
-          title: 'Self-Preference',
-          description: 'Judge will favor outputs from similar models when scoring',
-          reference: 'arxiv.org/abs/2404.13076'
-        },
-        {
-          id: 'input-perturbation',
-          title: 'Blindness to Input Perturbation',
-          description: 'If input is perturbed, judges don\'t detect quality drops consistently',
-          reference: 'arxiv.org/abs/2406.13439'
-        },
-        {
-          id: 'position-bias',
-          title: 'Position Bias',
-          description: 'When comparing answers, judge favors specific answer positions (e.g: systematically prefers first or second choice)',
-          reference: 'arxiv.org/abs/2306.05685'
-        },
-        {
-          id: 'verbosity-bias',
-          title: 'Verbosity Bias',
-          description: 'Models prefer more verbose answers',
-          reference: 'arxiv.org/abs/2404.04475'
-        },
-        {
-          id: 'human-consistency',
-          title: 'No Consistency With Human Scoring',
-          description: 'LLM ratings diverge from human ratings',
-          reference: 'arxiv.org/abs/2308.15812'
-        },
-        {
-          id: 'format-bias',
-          title: 'Format Bias',
-          description: 'Judge can\'t judge well when their prompt differs from their training prompt format',
-          reference: 'arxiv.org/abs/2310.17631'
-        }
-      ];
-      const svg = d3.select(container).append('svg');
-      const g = svg.append('g');
-      let width = 800;
-      let height = 300;
-      // Helper function to wrap text
-      function wrapText(text, width) {
-        text.each(function() {
-          const text = d3.select(this);
-          const words = text.text().split(/\s+/).reverse();
-          let word;
-          let line = [];
-          let lineNumber = 0;
-          const lineHeight = 1.3;
-          const y = text.attr('y');
-          const x = text.attr('x');
-          const dy = parseFloat(text.attr('dy') || 0);
-          let tspan = text.text(null).append('tspan')
-            .attr('x', x)
-            .attr('y', y)
-            .attr('dy', dy + 'em');
-          while ((word = words.pop())) {
-            line.push(word);
-            tspan.text(line.join(' '));
-            if (tspan.node().getComputedTextLength() > width) {
-              line.pop();
-              tspan.text(line.join(' '));
-              line = [word];
-              tspan = text.append('tspan')
-                .attr('x', x)
-                .attr('y', y)
-                .attr('dy', ++lineNumber * lineHeight + dy + 'em')
-                .text(word);
-            }
-          }
-        });
-      }
-      function render() {
-        width = container.clientWidth || 800;
-        height = Math.max(550, Math.round(width * 0.7));
-        svg.attr('width', width).attr('height', height);
-        const margin = { top: 40, right: 20, bottom: 20, left: 20 };
-        const innerWidth = width - margin.left - margin.right;
-        const innerHeight = height - margin.top - margin.bottom;
-        g.attr('transform', `translate(${margin.left},${margin.top})`);
-        // Clear previous content
-        g.selectAll('*').remove();
-        const colors = getColors();
-        // Header
-        g.append('text')
-          .attr('class', 'header-text')
-          .attr('x', innerWidth / 2)
-          .attr('y', -15)
-          .attr('text-anchor', 'middle')
-          .text('LLM JUDGE BIASES');
-        // Calculate card dimensions - 4 rows: 2 cards each
-        const cols = 2;
-        const rows = 4;
-        const cardSpacingX = Math.min(20, innerWidth * 0.03);
-        const cardSpacingY = Math.min(18, innerHeight * 0.04);
-        const cardWidth = (innerWidth - cardSpacingX * (cols - 1)) / cols;
-        const cardHeight = (innerHeight - cardSpacingY * (rows - 1)) / rows;
-        // Draw cards in 4 rows (2 + 2 + 2 + 2)
-        biases.forEach((bias, i) => {
-          const row = Math.floor(i / 2);
-          const col = i % 2;
-          const x = col * (cardWidth + cardSpacingX);
-          const y = row * (cardHeight + cardSpacingY);
-          const cardGroup = g.append('g')
-            .attr('transform', `translate(${x},${y})`);
-          // Card background with frame
-          cardGroup.append('rect')
-            .attr('class', 'card-rect')
-            .attr('width', cardWidth)
-            .attr('height', cardHeight)
-            .attr('rx', 12)
-            .attr('fill', colors[i])
-            .attr('fill-opacity', 0.12)
-            .attr('stroke', colors[i])
-            .attr('stroke-opacity', 0.6)
-            .attr('stroke-width', 2);
-          // Title
-          cardGroup.append('text')
-            .attr('class', 'bias-title')
-            .attr('x', cardWidth / 2)
-            .attr('y', 20)
-            .attr('text-anchor', 'middle')
-            .text(bias.title);
-          // Description with wrapping
-          const descText = cardGroup.append('text')
-            .attr('class', 'bias-description')
-            .attr('x', cardWidth / 2)
-            .attr('y', 36)
-            .attr('text-anchor', 'middle')
-            .attr('dy', 0)
-            .text(bias.description);
-          wrapText(descText, cardWidth - 20);
-          // Example box (only if there's an example)
-          if (bias.example) {
-            const exampleY = cardHeight - 55;
-            const exampleHeight = 24;
-            cardGroup.append('rect')
-              .attr('x', 8)
-              .attr('y', exampleY)
-              .attr('width', cardWidth - 16)
-              .attr('height', exampleHeight)
-              .attr('rx', 4)
-              .attr('fill', colors[i])
-              .attr('fill-opacity', 0.15)
-              .attr('stroke', colors[i])
-              .attr('stroke-width', 1)
-              .attr('stroke-opacity', 0.4);
-            // Example text
-            cardGroup.append('text')
-              .attr('class', 'bias-description')
-              .attr('x', cardWidth / 2)
-              .attr('y', exampleY + 13)
-              .attr('text-anchor', 'middle')
-              .attr('dominant-baseline', 'middle')
-              .attr('font-size', 9)
-              .text(bias.example);
-          }
-          // Reference link (if exists)
-          if (bias.reference) {
-            const refY = bias.example ? cardHeight - 8 : cardHeight - 12;
-            const refLink = cardGroup.append('a')
-              .attr('href', `https://${bias.reference}`)
-              .attr('target', '_blank')
-              .attr('rel', 'noopener noreferrer');
-            refLink.append('text')
-              .attr('class', 'example-label')
-              .attr('x', cardWidth - 10)
-              .attr('y', bias.reference2 ? refY - 10 : refY)
-              .attr('text-anchor', 'end')
-              .attr('font-size', 8)
-              .attr('fill', colors[i])
-              .attr('opacity', 0.7)
-              .style('cursor', 'pointer')
-              .style('text-decoration', 'underline')
-              .text(bias.reference)
-              .on('mouseenter', function() {
-                d3.select(this).attr('opacity', 1);
-              })
-              .on('mouseleave', function() {
-                d3.select(this).attr('opacity', 0.7);
-              });
-          }
-          // Second reference link (if exists)
-          if (bias.reference2) {
-            const refY = bias.example ? cardHeight - 8 : cardHeight - 12;
-            const refLink2 = cardGroup.append('a')
-              .attr('href', `https://${bias.reference2}`)
-              .attr('target', '_blank')
-              .attr('rel', 'noopener noreferrer');
-            refLink2.append('text')
-              .attr('class', 'example-label')
-              .attr('x', cardWidth - 10)
-              .attr('y', refY)
-              .attr('text-anchor', 'end')
-              .attr('font-size', 8)
-              .attr('fill', colors[i])
-              .attr('opacity', 0.7)
-              .style('cursor', 'pointer')
-              .style('text-decoration', 'underline')
-              .text(bias.reference2)
-              .on('mouseenter', function() {
-                d3.select(this).attr('opacity', 1);
-              })
-              .on('mouseleave', function() {
-                d3.select(this).attr('opacity', 0.7);
-              });
-          }
-        });
-      }
-      render();
-      // Responsive handling
-      if (window.ResizeObserver) {
-        const ro = new ResizeObserver(() => render());
-        ro.observe(container);
-      } else {
-        window.addEventListener('resize', render);
-      }
-    };
-    if (document.readyState === 'loading') {
-      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
-    } else {
-      ensureD3(bootstrap);
-    }
-  })();
-</script>

app/src/content/embeds/d3-mmlu-heatmap.html CHANGED Viewed

@@ -173,44 +173,30 @@
         [43.6, 48.9, 49.5, 51.0, 51.3, 52.0, 52.8, 52.3]  // DeciLM-7B
       ];
-      // Colors: diverging palette (purple for low, yellow for high)
       const getDivergingColors = (count) => {
         try {
           if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
             return window.ColorPalettes.getColors('diverging', count);
           }
         } catch (_) { }
-        // Fallback: diverging scale from purple (low) to yellow (high)
         const colors = [];
         for (let i = 0; i < count; i++) {
           const t = i / (count - 1);
-          // Purple (dark) -> lighter purple -> green -> yellow
-          if (t < 0.25) {
-            // Dark purple to medium purple
-            const r = Math.round(75 + (t / 0.25) * 50);
-            const g = Math.round(0 + (t / 0.25) * 30);
-            const b = Math.round(130 + (t / 0.25) * 50);
-            colors.push(`rgb(${r}, ${g}, ${b})`);
-          } else if (t < 0.5) {
-            // Purple to blue-green
-            const t2 = (t - 0.25) / 0.25;
-            const r = Math.round(125 - t2 * 75);
-            const g = Math.round(30 + t2 * 100);
-            const b = Math.round(180 - t2 * 80);
-            colors.push(`rgb(${r}, ${g}, ${b})`);
-          } else if (t < 0.75) {
-            // Blue-green to green
-            const t2 = (t - 0.5) / 0.25;
-            const r = Math.round(50 + t2 * 50);
-            const g = Math.round(130 + t2 * 70);
-            const b = Math.round(100 - t2 * 50);
             colors.push(`rgb(${r}, ${g}, ${b})`);
           } else {
-            // Green to yellow
-            const t2 = (t - 0.75) / 0.25;
-            const r = Math.round(100 + t2 * 155);
-            const g = Math.round(200 - t2 * 50);
-            const b = Math.round(50 - t2 * 50);
             colors.push(`rgb(${r}, ${g}, ${b})`);
           }
         }
@@ -220,7 +206,7 @@
       const palette = getDivergingColors(10);
       let width = 900;
-      const margin = { top: 10, right: 20, bottom: 20, left: 100 }; // Only left margin for model names
       function updateSize() {
         width = container.clientWidth || 900;
@@ -251,27 +237,8 @@
       }
       function getColorScale(values, minV, maxV) {
-        const hasPalette = palette.length > 0;
-        if (hasPalette && window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
-          // Use quantile scale but with emphasis on extremes
-          const sorted = [...values].sort((a, b) => a - b);
-          const n = sorted.length;
-          // Create custom quantiles that emphasize extremes
-          const quantiles = [];
-          for (let i = 0; i <= 10; i++) {
-            const q = i / 10;
-            // Apply a power transformation to emphasize extremes
-            const transformedQ = q < 0.5
-              ? Math.pow(q * 2, 1.5) / 2
-              : 0.5 + Math.pow((q - 0.5) * 2, 1.5) / 2;
-            const idx = Math.floor(transformedQ * (n - 1));
-            quantiles.push(sorted[Math.min(idx, n - 1)]);
-          }
-          const scale = d3.scaleQuantile().domain(quantiles).range(palette);
-          return (v) => scale(v);
-        }
-        // Fallback: non-linear scale that emphasizes extremes
         const linearScale = d3.scaleLinear()
           .domain([minV, maxV])
           .range([0, 1])
@@ -280,36 +247,30 @@
         return (v) => {
           const t = linearScale(v);
           // Apply power transformation to emphasize extremes
           let transformedT;
           if (t < 0.5) {
             transformedT = Math.pow(t * 2, 1.8) / 2;
           } else {
             transformedT = 0.5 + Math.pow((t - 0.5) * 2, 1.8) / 2;
           }
-          // Purple (low) -> Green (mid) -> Yellow (high)
-          if (transformedT < 0.25) {
-            const r = Math.round(75 + (transformedT / 0.25) * 50);
-            const g = Math.round(0 + (transformedT / 0.25) * 30);
-            const b = Math.round(130 + (transformedT / 0.25) * 50);
-            return `rgb(${r}, ${g}, ${b})`;
-          } else if (transformedT < 0.5) {
-            const t2 = (transformedT - 0.25) / 0.25;
-            const r = Math.round(125 - t2 * 75);
-            const g = Math.round(30 + t2 * 100);
-            const b = Math.round(180 - t2 * 80);
-            return `rgb(${r}, ${g}, ${b})`;
-          } else if (transformedT < 0.75) {
-            const t2 = (transformedT - 0.5) / 0.25;
-            const r = Math.round(50 + t2 * 50);
-            const g = Math.round(130 + t2 * 70);
-            const b = Math.round(100 - t2 * 50);
             return `rgb(${r}, ${g}, ${b})`;
           } else {
-            const t2 = (transformedT - 0.75) / 0.25;
-            const r = Math.round(100 + t2 * 155);
-            const g = Math.round(200 - t2 * 50);
-            const b = Math.round(50 - t2 * 50);
             return `rgb(${r}, ${g}, ${b})`;
           }
         };
@@ -344,12 +305,12 @@
         const x = d3.scaleBand()
           .domain(d3.range(nCols))
           .range([0, gridWidth])
-          .paddingInner(0.08);
         const y = d3.scaleBand()
           .domain(d3.range(nRows))
           .range([0, gridHeight])
-          .paddingInner(0.08);
         // Flatten matrix data
         const flatData = [];
@@ -367,6 +328,20 @@
         gCells.attr('transform', `translate(${gridOffsetX}, ${gridOffsetY})`);
         const cells = gCells.selectAll('g.cell')
           .data(flatData, d => `${d.r}-${d.c}`);
@@ -376,8 +351,8 @@
           .attr('class', 'cell');
         cellsEnter.append('rect')
-          .attr('rx', 3)
-          .attr('ry', 3)
           .on('mousemove', (event, d) => {
             const [px, py] = d3.pointer(event, container);
             tipInner.innerHTML = `<strong>${d.model}</strong><br/>${d.format}<br/>Score: ${d.value.toFixed(1)}`;
@@ -400,9 +375,7 @@
           .attr('y', d => y(d.r))
           .attr('width', Math.max(1, x.bandwidth()))
           .attr('height', Math.max(1, y.bandwidth()))
-          .attr('fill', d => colorScale(d.value))
-          .attr('stroke', 'var(--border-color)')
-          .attr('stroke-width', 0.5);
         cellsMerged.select('text')
           .attr('x', d => x(d.c) + x.bandwidth() / 2)

         [43.6, 48.9, 49.5, 51.0, 51.3, 52.0, 52.8, 52.3]  // DeciLM-7B
       ];
+      // Colors: red to green palette (red for low, green for high)
       const getDivergingColors = (count) => {
         try {
           if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
             return window.ColorPalettes.getColors('diverging', count);
           }
         } catch (_) { }
+        // Fallback: red to green scale
         const colors = [];
         for (let i = 0; i < count; i++) {
           const t = i / (count - 1);
+          // Red (low) -> Yellow (mid) -> Green (high)
+          if (t < 0.5) {
+            // Red to yellow
+            const r = 255;
+            const g = Math.round(t * 2 * 255);
+            const b = 0;
             colors.push(`rgb(${r}, ${g}, ${b})`);
           } else {
+            // Yellow to green
+            const t2 = (t - 0.5) * 2;
+            const r = Math.round(255 - t2 * 255);
+            const g = 255;
+            const b = 0;
             colors.push(`rgb(${r}, ${g}, ${b})`);
           }
         }
       const palette = getDivergingColors(10);
       let width = 900;
+      const margin = { top: 0, right: 0, bottom: 0, left: 100 }; // Only left margin for model names
       function updateSize() {
         width = container.clientWidth || 900;
       }
       function getColorScale(values, minV, maxV) {
+        // Always use the custom red-to-green palette (fallback)
+        // Don't use ColorPalettes for this specific heatmap
         const linearScale = d3.scaleLinear()
           .domain([minV, maxV])
           .range([0, 1])
         return (v) => {
           const t = linearScale(v);
           // Apply power transformation to emphasize extremes
+          // Values near min/max get more extreme colors
           let transformedT;
           if (t < 0.5) {
+            // Compress lower values, making extremes more distinct
             transformedT = Math.pow(t * 2, 1.8) / 2;
           } else {
+            // Expand upper values, making extremes more distinct
             transformedT = 0.5 + Math.pow((t - 0.5) * 2, 1.8) / 2;
           }
+          // Red to green scale: red (low scores = bad) -> yellow (mid) -> green (high scores = good)
+          // Less flashy: reduce saturation
+          if (transformedT < 0.5) {
+            // Red to yellow (less saturated)
+            const r = 220;
+            const g = Math.round(80 + transformedT * 2 * 140);
+            const b = Math.round(60 + transformedT * 2 * 40);
             return `rgb(${r}, ${g}, ${b})`;
           } else {
+            // Yellow to green (less saturated)
+            const t2 = (transformedT - 0.5) * 2;
+            const r = Math.round(220 - t2 * 100);
+            const g = 220;
+            const b = Math.round(100 - t2 * 60);
             return `rgb(${r}, ${g}, ${b})`;
           }
         };
         const x = d3.scaleBand()
           .domain(d3.range(nCols))
           .range([0, gridWidth])
+          .paddingInner(0);
         const y = d3.scaleBand()
           .domain(d3.range(nRows))
           .range([0, gridHeight])
+          .paddingInner(0);
         // Flatten matrix data
         const flatData = [];
         gCells.attr('transform', `translate(${gridOffsetX}, ${gridOffsetY})`);
+        // Add rounded corners only on the outer edges of the matrix using clipPath
+        const cornerRadius = 6;
+        defs.selectAll('#matrix-clip').remove();
+        const clipPath = defs.append('clipPath')
+          .attr('id', 'matrix-clip');
+        clipPath.append('rect')
+          .attr('x', 0)
+          .attr('y', 0)
+          .attr('width', gridWidth)
+          .attr('height', gridHeight)
+          .attr('rx', cornerRadius)
+          .attr('ry', cornerRadius);
+        gCells.attr('clip-path', 'url(#matrix-clip)');
         const cells = gCells.selectAll('g.cell')
           .data(flatData, d => `${d.r}-${d.c}`);
           .attr('class', 'cell');
         cellsEnter.append('rect')
+          .attr('rx', 0)
+          .attr('ry', 0)
           .on('mousemove', (event, d) => {
             const [px, py] = d3.pointer(event, container);
             tipInner.innerHTML = `<strong>${d.model}</strong><br/>${d.format}<br/>Score: ${d.value.toFixed(1)}`;
           .attr('y', d => y(d.r))
           .attr('width', Math.max(1, x.bandwidth()))
           .attr('height', Math.max(1, y.bandwidth()))
+          .attr('fill', d => colorScale(d.value));
         cellsMerged.select('text')
           .attr('x', d => x(d.c) + x.bandwidth() / 2)

app/src/content/embeds/d3-sampling-metrics.html CHANGED Viewed

@@ -85,12 +85,6 @@
     letter-spacing: 0.05em;
   }
-  .d3-sampling-metrics .section-title.sampling-metrics {
-    stroke: var(--surface-bg);
-    stroke-width: 10px;
-    paint-order: stroke fill;
-  }
   .d3-sampling-metrics .question-text {
     fill: var(--text-color);
     font-size: 14px;
@@ -287,11 +281,11 @@
       function render() {
         width = container.clientWidth || 800;
-        height = Math.max(300, Math.round(width * 0.42));
         svg.attr('width', width).attr('height', height);
-        const margin = { top: 50, right: 20, bottom: 20, left: 20 };
         const innerWidth = width - margin.left - margin.right;
         const innerHeight = height - margin.top - margin.bottom;
@@ -325,7 +319,7 @@
         const metricBoxHeight = 75;
         // Position samples in a row
-        const samplesY = 40;
         const sampleSpacing = (innerWidth - sampleBoxWidth * samples.length) / (samples.length + 1);
         const sampleNodes = samples.map((d, i) => ({
@@ -352,10 +346,17 @@
         g.append('text')
           .attr('class', 'section-title')
           .attr('x', innerWidth / 2)
-          .attr('y', samplesY - 20)
           .attr('text-anchor', 'middle')
           .text('5 SAMPLED GENERATIONS');
         // Draw connection lines from samples to metrics
         const linkGroup = g.append('g').attr('class', 'links');
@@ -483,14 +484,6 @@
           .attr('text-anchor', 'middle')
           .attr('fill', colors.metric)
           .text(d => d.result);
-        // Ajouter "SAMPLING METRICS" en dernier pour qu'il soit au-dessus de tout
-        g.append('text')
-          .attr('class', 'section-title sampling-metrics')
-          .attr('x', innerWidth / 2)
-          .attr('y', metricsY - 20)
-          .attr('text-anchor', 'middle')
-          .text('SAMPLING METRICS');
       }
       render();

     letter-spacing: 0.05em;
   }
   .d3-sampling-metrics .question-text {
     fill: var(--text-color);
     font-size: 14px;
       function render() {
         width = container.clientWidth || 800;
+        height = Math.max(350, Math.round(width * 0.42));
         svg.attr('width', width).attr('height', height);
+        const margin = { top: 60, right: 20, bottom: 20, left: 20 };
         const innerWidth = width - margin.left - margin.right;
         const innerHeight = height - margin.top - margin.bottom;
         const metricBoxHeight = 75;
         // Position samples in a row
+        const samplesY = 20;
         const sampleSpacing = (innerWidth - sampleBoxWidth * samples.length) / (samples.length + 1);
         const sampleNodes = samples.map((d, i) => ({
         g.append('text')
           .attr('class', 'section-title')
           .attr('x', innerWidth / 2)
+          .attr('y', samplesY - 10)
           .attr('text-anchor', 'middle')
           .text('5 SAMPLED GENERATIONS');
+        g.append('text')
+          .attr('class', 'section-title')
+          .attr('x', innerWidth / 2)
+          .attr('y', metricsY - 10)
+          .attr('text-anchor', 'middle')
+          .text('SAMPLING METRICS');
         // Draw connection lines from samples to metrics
         const linkGroup = g.append('g').attr('class', 'links');
           .attr('text-anchor', 'middle')
           .attr('fill', colors.metric)
           .text(d => d.result);
       }
       render();

app/src/content/embeds/d3-text-metrics.html CHANGED Viewed

@@ -43,6 +43,10 @@
     transition: border-color 0.2s;
   }
   .d3-text-metrics .metric-name {
     font-size: 13px;
     font-weight: 600;

     transition: border-color 0.2s;
   }
+  .d3-text-metrics .metric-box:hover {
+    border-color: var(--primary-color);
+  }
   .d3-text-metrics .metric-name {
     font-size: 13px;
     font-weight: 600;

app/src/content/embeds/d3-tokenization-timeline.html CHANGED Viewed

@@ -1,5 +1,5 @@
 <div class="d3-prompt-evolution">
-  <svg viewBox="0 0 900 370" xmlns="http://www.w3.org/2000/svg">
     <defs>
       <marker id="arrowhead-prompt" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
         <polygon points="0 0, 10 3, 0 6" fill="currentColor" />

 <div class="d3-prompt-evolution">
+  <svg viewBox="0 0 900 500" xmlns="http://www.w3.org/2000/svg">
     <defs>
       <marker id="arrowhead-prompt" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
         <polygon points="0 0, 10 3, 0 6" fill="currentColor" />

app/src/content/embeds/d3-vibe-checks.html DELETED Viewed

@@ -1,338 +0,0 @@
-<div class="d3-vibe-checks"></div>
-<style>
-  .d3-vibe-checks {
-    font-family: var(--default-font-family);
-    background: transparent !important;
-    border: none !important;
-    border-radius: 0 !important;
-    padding: var(--spacing-4) 0;
-    width: 100%;
-    margin: 0 auto;
-    position: relative;
-    box-shadow: none !important;
-  }
-  .d3-vibe-checks svg {
-    width: 100%;
-    height: auto;
-    display: block;
-  }
-  .d3-vibe-checks .card-rect {
-    stroke-width: 2;
-    transition: all 0.3s ease;
-  }
-  .d3-vibe-checks .card-title {
-    fill: var(--text-color);
-    font-size: 13px;
-    font-weight: 700;
-  }
-  .d3-vibe-checks .card-question {
-    fill: var(--text-color);
-    font-size: 12px;
-    font-weight: 500;
-    font-style: italic;
-  }
-  .d3-vibe-checks .card-label {
-    fill: var(--muted-color);
-    font-size: 10px;
-    font-weight: 600;
-    text-transform: uppercase;
-    letter-spacing: 0.05em;
-  }
-  .d3-vibe-checks .header-text {
-    fill: var(--text-color);
-    font-size: 12px;
-    font-weight: 700;
-    text-transform: uppercase;
-    letter-spacing: 0.05em;
-  }
-  @media (max-width: 768px) {
-    .d3-vibe-checks .card-title {
-      font-size: 11px;
-    }
-    .d3-vibe-checks .card-question {
-      font-size: 10px;
-    }
-  }
-</style>
-<script>
-  (() => {
-    const ensureD3 = (cb) => {
-      if (window.d3 && typeof window.d3.select === 'function') return cb();
-      let s = document.getElementById('d3-cdn-script');
-      if (!s) {
-        s = document.createElement('script');
-        s.id = 'd3-cdn-script';
-        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
-        document.head.appendChild(s);
-      }
-      const onReady = () => {
-        if (window.d3 && typeof window.d3.select === 'function') cb();
-      };
-      s.addEventListener('load', onReady, { once: true });
-      if (window.d3) onReady();
-    };
-    const bootstrap = () => {
-      const scriptEl = document.currentScript;
-      let container = scriptEl ? scriptEl.previousElementSibling : null;
-      if (!(container && container.classList && container.classList.contains('d3-vibe-checks'))) {
-        const candidates = Array.from(document.querySelectorAll('.d3-vibe-checks'))
-          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
-        container = candidates[candidates.length - 1] || null;
-      }
-      if (!container) return;
-      if (container.dataset) {
-        if (container.dataset.mounted === 'true') return;
-        container.dataset.mounted = 'true';
-      }
-      // Get colors from ColorPalettes or fallback
-      const getColors = () => {
-        if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
-          return window.ColorPalettes.getColors('categorical', 3);
-        }
-        return ['#1f77b4', '#ff7f0e', '#2ca02c'];
-      };
-      // Vibe-check examples
-      const vibeChecks = [
-        {
-          id: 'strawberry',
-          title: 'Letter Counting',
-          question: 'How many "r"s in "strawberry"?',
-          category: 'Reasoning',
-          answers: [
-            { label: 'Model A', text: '3', correct: true },
-            { label: 'Model B', text: '2', correct: false }
-          ]
-        },
-        {
-          id: 'numbers',
-          title: 'Number Comparison',
-          question: 'Is 9.9 bigger or smaller than 9.11?',
-          category: 'Math',
-          answers: [
-            { label: 'Model A', text: '9.9 < 9.11', correct: false },
-            { label: 'Model B', text: '9.9 > 9.11', correct: true }
-          ]
-        },
-        {
-          id: 'tikz',
-          title: 'Creative Generation',
-          question: 'Draw a unicorn in TikZ',
-          category: 'Coding',
-          answers: [
-            { label: 'Model A', text: '\\draw[...] unicorn', correct: true },
-            { label: 'Model B', text: 'Error: invalid', correct: false }
-          ]
-        }
-      ];
-      // Function to draw model answers
-      function drawAnswers(g, x, y, width, answers, color) {
-        const answerHeight = 25;
-        const answerSpacing = 8;
-        const startY = y;
-        answers.forEach((answer, i) => {
-          const answerY = startY + i * (answerHeight + answerSpacing);
-          // Answer box
-          const boxGroup = g.append('g');
-          boxGroup.append('rect')
-            .attr('x', x - width / 2 + 10)
-            .attr('y', answerY)
-            .attr('width', width - 20)
-            .attr('height', answerHeight)
-            .attr('rx', 6)
-            .attr('fill', color)
-            .attr('fill-opacity', answer.correct ? 0.2 : 0.08)
-            .attr('stroke', color)
-            .attr('stroke-width', 1.5)
-            .attr('stroke-opacity', answer.correct ? 0.6 : 0.3);
-          // Combined label and answer text
-          const labelText = answer.label + ': ';
-          const combinedText = labelText + answer.text;
-          boxGroup.append('text')
-            .attr('x', x - width / 2 + 18)
-            .attr('y', answerY + answerHeight / 2)
-            .attr('dominant-baseline', 'middle')
-            .attr('font-size', 11)
-            .attr('fill', color)
-            .html(() => {
-              return `<tspan font-weight="600" opacity="0.8" font-size="10">${answer.label}: </tspan><tspan font-weight="${answer.correct ? 600 : 400}" opacity="${answer.correct ? 1 : 0.6}">${answer.text}</tspan>`;
-            });
-          // Checkmark or X
-          if (answer.correct) {
-            boxGroup.append('text')
-              .attr('x', x + width / 2 - 28)
-              .attr('y', answerY + answerHeight / 2)
-              .attr('dominant-baseline', 'middle')
-              .attr('font-size', 14)
-              .attr('font-weight', 700)
-              .attr('fill', color)
-              .text('✓');
-          } else {
-            boxGroup.append('text')
-              .attr('x', x + width / 2 - 28)
-              .attr('y', answerY + answerHeight / 2)
-              .attr('dominant-baseline', 'middle')
-              .attr('font-size', 14)
-              .attr('font-weight', 400)
-              .attr('fill', color)
-              .attr('opacity', 0.4)
-              .text('✗');
-          }
-        });
-      }
-      const svg = d3.select(container).append('svg');
-      const g = svg.append('g');
-      let width = 800;
-      let height = 300;
-      // Helper function to wrap text
-      function wrapText(text, width) {
-        text.each(function() {
-          const text = d3.select(this);
-          const words = text.text().split(/\s+/).reverse();
-          let word;
-          let line = [];
-          let lineNumber = 0;
-          const lineHeight = 1.2;
-          const y = text.attr('y');
-          const x = text.attr('x');
-          const dy = parseFloat(text.attr('dy') || 0);
-          let tspan = text.text(null).append('tspan')
-            .attr('x', x)
-            .attr('y', y)
-            .attr('dy', dy + 'em');
-          while ((word = words.pop())) {
-            line.push(word);
-            tspan.text(line.join(' '));
-            if (tspan.node().getComputedTextLength() > width) {
-              line.pop();
-              tspan.text(line.join(' '));
-              line = [word];
-              tspan = text.append('tspan')
-                .attr('x', x)
-                .attr('y', y)
-                .attr('dy', ++lineNumber * lineHeight + dy + 'em')
-                .text(word);
-            }
-          }
-        });
-      }
-      function render() {
-        width = container.clientWidth || 800;
-        height = Math.max(250, Math.round(width * 0.4));
-        svg.attr('width', width).attr('height', height);
-        const margin = { top: 40, right: 20, bottom: 20, left: 20 };
-        const innerWidth = width - margin.left - margin.right;
-        const innerHeight = height - margin.top - margin.bottom;
-        g.attr('transform', `translate(${margin.left},${margin.top})`);
-        // Clear previous content
-        g.selectAll('*').remove();
-        const colors = getColors();
-        // Header
-        g.append('text')
-          .attr('class', 'header-text')
-          .attr('x', innerWidth / 2)
-          .attr('y', -15)
-          .attr('text-anchor', 'middle')
-          .text('VIBE-CHECK EXAMPLES');
-        // Calculate card dimensions
-        const cardSpacing = Math.min(20, innerWidth * 0.03);
-        const cardWidth = (innerWidth - cardSpacing * 2) / 3;
-        const cardHeight = innerHeight * 0.85;
-        const cardY = innerHeight * 0.1;
-        // Draw cards
-        vibeChecks.forEach((check, i) => {
-          const x = i * (cardWidth + cardSpacing);
-          const cardGroup = g.append('g')
-            .attr('transform', `translate(${x},${cardY})`);
-          // Card background with frame
-          cardGroup.append('rect')
-            .attr('class', 'card-rect')
-            .attr('width', cardWidth)
-            .attr('height', cardHeight)
-            .attr('rx', 12)
-            .attr('fill', colors[i])
-            .attr('fill-opacity', 0.12)
-            .attr('stroke', colors[i])
-            .attr('stroke-opacity', 0.6)
-            .attr('stroke-width', 2);
-          // Title
-          cardGroup.append('text')
-            .attr('class', 'card-title')
-            .attr('x', cardWidth / 2)
-            .attr('y', 25)
-            .attr('text-anchor', 'middle')
-            .text(check.title);
-          // Question with wrapping
-          const questionText = cardGroup.append('text')
-            .attr('class', 'card-question')
-            .attr('x', cardWidth / 2)
-            .attr('y', 45)
-            .attr('text-anchor', 'middle')
-            .attr('dy', 0)
-            .text(check.question);
-          // Apply text wrapping
-          wrapText(questionText, cardWidth - 30);
-          // Model answers
-          const answersY = cardHeight * 0.5;
-          drawAnswers(cardGroup, cardWidth / 2, answersY, cardWidth, check.answers, colors[i]);
-        });
-      }
-      render();
-      // Responsive handling
-      if (window.ResizeObserver) {
-        const ro = new ResizeObserver(() => render());
-        ro.observe(container);
-      } else {
-        window.addEventListener('resize', render);
-      }
-    };
-    if (document.readyState === 'loading') {
-      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
-    } else {
-      ensureD3(bootstrap);
-    }
-  })();
-</script>

app/src/styles/components/_table.css CHANGED Viewed

@@ -10,7 +10,7 @@
   border-bottom: 1px solid var(--border-color);
   padding: 6px 8px;
   font-size: 15px;
-  /* white-space: nowrap; */
   /* prevent squashing; allow horizontal scroll instead */
   word-break: auto-phrase;
   /* white-space: break-spaces; */

   border-bottom: 1px solid var(--border-color);
   padding: 6px 8px;
   font-size: 15px;
+  white-space: nowrap;
   /* prevent squashing; allow horizontal scroll instead */
   word-break: auto-phrase;
   /* white-space: break-spaces; */