Update app/src/content/chapters/intro.mdx

#1
by guipenedo HF Staff - opened
README.md CHANGED
@@ -5,7 +5,7 @@ emoji: 📝
5
  colorFrom: blue
6
  colorTo: indigo
7
  sdk: docker
8
- pinned: true
9
  header: mini
10
  app_port: 8080
11
  tags:
 
5
  colorFrom: blue
6
  colorTo: indigo
7
  sdk: docker
8
+ pinned: false
9
  header: mini
10
  app_port: 8080
11
  tags:
app/src/content/article.mdx CHANGED
@@ -1,12 +1,12 @@
1
  ---
2
- title: "The LLM Evaluation Guidebook"
3
- subtitle: "All the things you could want to know about LLM evaluation based on our experience scoring 15000 models over 3 years"
4
  description: "Understanding the tips and tricks of evaluating an LLM in 2025"
5
  authors:
6
  - name: "Clémentine Fourrier"
7
  url: "https://huggingface.co/clefourrier"
8
  affiliations: [1]
9
- - name: "Thibaud Frere"
10
  url: "https://huggingface.co/tfrere"
11
  affiliations: [1]
12
  - name: "Guilherme Penedo"
@@ -18,7 +18,7 @@ authors:
18
  affiliations:
19
  - name: "Hugging Face"
20
  url: "https://huggingface.co"
21
- published: "Dec. 03, 2025"
22
  tags:
23
  - research
24
  - evaluation
@@ -47,21 +47,7 @@ Now that you have an idea of why evaluation is important to different people, le
47
 
48
  ## Evaluating with existing benchmarks
49
 
50
- Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team.
51
-
52
- <Note title="Important concepts" emoji="⚠️" variant="info">
53
- In this section, you'll see two concepts mentionned quite a lot: contamination and saturation.
54
-
55
- **Saturation** is when model performance on a benchmark passes human performance. More generally, the term is used for datasets that are no longer considered useful, as they have lost discriminative power between models.
56
- <Sidenote> It's what you observe in the banner picture! </Sidenote>
57
-
58
- *If all models have close to the highest possible score on your evaluation, it's no longer a discriminative benchmark. It's similar to evaluating high school students on pre-school problems: success tells you nothing (though failure is indicative).*
59
-
60
- **Contamination** is when an evaluation dataset ended up in the training dataset of models, in which case the performance of models is artificially inflated, and does not reflect real world performance on the task.
61
-
62
- *It's a bit like evaluating a student on questions it already knows in advance.*
63
-
64
- </Note>
65
 
66
  ### Benchmarks to know in 2025
67
 
@@ -151,6 +137,3 @@ Key things I hope you'll remember are:
151
  To conclude: The models we build are only as good as our ability to measure what matters. Thanks for reading!
152
 
153
 
154
- ### Acknowledgments
155
-
156
- Many thanks to all the people who contributed directly or indirectly to this document, notably Hynek Kydlicek, Loubna Ben Allal, Sander Land and Nathan Habib.
 
1
  ---
2
+ title: "The Evaluation Guidebook"
3
+ subtitle: "Understanding the tips and tricks of evaluating an LLM in 2025"
4
  description: "Understanding the tips and tricks of evaluating an LLM in 2025"
5
  authors:
6
  - name: "Clémentine Fourrier"
7
  url: "https://huggingface.co/clefourrier"
8
  affiliations: [1]
9
+ - name: "Thibaud Frère"
10
  url: "https://huggingface.co/tfrere"
11
  affiliations: [1]
12
  - name: "Guilherme Penedo"
 
18
  affiliations:
19
  - name: "Hugging Face"
20
  url: "https://huggingface.co"
21
+ published: "Dec. 01, 2025"
22
  tags:
23
  - research
24
  - evaluation
 
47
 
48
  ## Evaluating with existing benchmarks
49
 
50
+ Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ### Benchmarks to know in 2025
53
 
 
137
  To conclude: The models we build are only as good as our ability to measure what matters. Thanks for reading!
138
 
139
 
 
 
 
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -8,7 +8,6 @@ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
8
  import Image from "../../../components/Image.astro";
9
  import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
10
  import envImage from '../../assets/image/env.png';
11
- import Wide from "../../../components/Wide.astro";
12
 
13
  ### Dataset
14
 
@@ -24,8 +23,6 @@ When aggregating datasets, pay attention to whether
24
 
25
  <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
26
 
27
- New research by EpochAI (2025) showcases how to [best aggregate benchmarks together under a single framework](https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks) to make the aggregated dataset harder overall and less prone to saturation.
28
-
29
  <UsingHumanAnnotators />
30
 
31
  #### Creating a dataset synthetically
@@ -33,7 +30,7 @@ New research by EpochAI (2025) showcases how to [best aggregate benchmarks toget
33
 
34
  If your task allows, using procedurally generated benchmarks is a very good way to get a virtually infinite supply of samples and avoid contamination! They can generate unlimited fresh test cases algorithmically, while controlling difficulty and enabling automatic verification, ensuring models haven't seen examples during training.
35
 
36
- For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), [ZebraLogic](https://arxiv.org/pdf/2502.01100), IFEval, or GSMTemplate among others. **NPHardEval** generates complexity-grounded tasks like graph problems with automatic verification and monthly refreshes to reduce overfitting. **MuSR** creates complex reasoning instances like 1000-word murder mysteries using neurosymbolic generation. **ZebraLogic** algorithmically produces logic grid puzzles by generating solutions and iteratively minimizing clues using SAT solvers. **BabiQA** simulates entities following successions of actions. **IFEval** tests instruction-following with 500+ prompts containing verifiable constraints like word counts that can be checked programmatically. **GSM-Symbolic** uses templates to generate diverse math questions.
37
 
38
  Tasks which usually fit this paradigm test mathematical, logical, or coding abilities.
39
 
@@ -236,9 +233,7 @@ However, when doing evaluation with humans, you need to make sure your annotator
236
 
237
  Different approaches exist to evaluate models with humans in the loop.
238
 
239
- **Vibe-checks** is the name given to manual evaluations done by individual members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on their use cases of preference. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Said use cases can be anything from the most exciting to the most mundate - to cite some I've seen on Reddit, they covered legal questions in German, coding, tool use, quality of erotica written, etc. Often shared on forums or social media, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).
240
-
241
- <HtmlEmbed src="d3-vibe-checks.html"/>
242
 
243
  Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
244
 
@@ -252,9 +247,11 @@ Once you want to scale to more systematic evaluation with paid annotators, you'l
252
  Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
253
  However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
254
 
255
- Overall, however, human evaluation has a number of well known biases, based first impressions, tone, alignement with annotators value, etc, see the figure below.
256
-
257
- <HtmlEmbed src="d3-human-biases.html"/>
 
 
258
 
259
  These biases are not unexpected, but they must be taken into account: not all use cases should rely on using cheap human annotators - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark (experts, automatic metrics if applicable, etc).
260
 
@@ -285,10 +282,8 @@ People in favor of judge LLMs have been claiming they provide better:
285
 
286
  In my opinion, using LLM judges correctly is extremely tricky, and it's **easy to be deceived for critical use cases**:
287
  - LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see below). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
288
- - They are indeed scalable, but contribute to creating **massive amounts of data** which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
289
- - They are indeed cheap to instantiate, but are not as good as paying actual expert human annotators for your specific use cases.
290
-
291
- <HtmlEmbed src="d3-llm-biases.html"/>
292
 
293
  This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
294
 
@@ -421,27 +416,41 @@ You need to decide what your threshold for acceptance is. Depending on how hard
421
 
422
  #### Tips and tricks
423
 
424
- <Note title="Mitigating well known biases of LLM as judges" emoji="⚠️" variant="warning">
425
- We discussed in this section's [intro](http://localhost:4321/#pros-and-cons-of-using-judge-llms) a number of LLM judges biases. Let's see how you should try to mitigate them.
426
 
 
427
  **Lack of internal consistency**:
 
 
428
  ➡️ You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
429
 
430
- **Self-preference**:
 
 
431
  ➡️ You can mitigate this by using a jury
432
 
433
- **Blindness to input perturbation**:
 
 
 
 
434
  ➡️ asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
435
  ➡️ or providing a coherent grading scale in the prompt.
436
 
437
- **Position-bias**:
 
438
  ➡️ switching answer positions randomly
439
  ➡️ computing the log-probabilities of all possible choices to get a normalized answer
440
 
441
- **Verbosity-bias** (or length-bias):
 
442
  ➡️ You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
443
 
444
- **Format bias**:
 
 
 
 
445
  ➡️ You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
446
  </Note>
447
 
@@ -554,11 +563,9 @@ You can also compute these with prompt variations, by asking the same questions
554
  ### Cost and efficiency
555
 
556
 
557
- When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 1 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
558
 
559
- <div className="card" style="height: fit-content; max-width: 75%; margin: 40px auto;">
560
- <img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="height: auto !important; object-fit: contain !important; display: block; margin: 0 auto;" />
561
- </div>
562
 
563
  We suggest you report the following:
564
  - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
 
8
  import Image from "../../../components/Image.astro";
9
  import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
10
  import envImage from '../../assets/image/env.png';
 
11
 
12
  ### Dataset
13
 
 
23
 
24
  <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
25
 
 
 
26
  <UsingHumanAnnotators />
27
 
28
  #### Creating a dataset synthetically
 
30
 
31
  If your task allows, using procedurally generated benchmarks is a very good way to get a virtually infinite supply of samples and avoid contamination! They can generate unlimited fresh test cases algorithmically, while controlling difficulty and enabling automatic verification, ensuring models haven't seen examples during training.
32
 
33
+ For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), [ZebraLogic](https://arxiv.org/pdf/2502.01100), IFEval, or GSMTemplate among others. **NPHardEval** generates complexity-grounded tasks like graph problems with automatic verification and monthly refreshes to reduce overfitting. **MuSR** creates complex reasoning instances like 1000-word murder mysteries using neurosymbolic generation. **ZebraLogic** algorithmically produces logic grid puzzles by generating solutions and iteratively minimizing clues using SAT solvers. **BabiQA** simulates worlds with entities and actions, with Dyna-bAbI providing fine-grained control over task generation. **IFEval** tests instruction-following with 500+ prompts containing verifiable constraints like word counts that can be checked programmatically. **GSM-Symbolic** uses templates to generate diverse math questions for controllable evaluation.
34
 
35
  Tasks which usually fit this paradigm test mathematical, logical, or coding abilities.
36
 
 
233
 
234
  Different approaches exist to evaluate models with humans in the loop.
235
 
236
+ **Vibe-checks** is the name given to manual evaluations done by individual members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on their use cases of preference. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Said use cases can be anything from the most exciting to the most mundate - to cite some I've seen on Reddit, they covered legal questions in German, coding, ability to generate tikz unicorns, tool use, quality of erotica written, etc. Often shared on forums or social media, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).
 
 
237
 
238
  Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
239
 
 
247
  Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
248
  However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
249
 
250
+ Overall, however, human evaluation has a number of well known biases:
251
+ - **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
252
+ - **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.)
253
+ - **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
254
+ - **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
255
 
256
  These biases are not unexpected, but they must be taken into account: not all use cases should rely on using cheap human annotators - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark (experts, automatic metrics if applicable, etc).
257
 
 
282
 
283
  In my opinion, using LLM judges correctly is extremely tricky, and it's **easy to be deceived for critical use cases**:
284
  - LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see below). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
285
+ - They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
286
+ - They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
 
 
287
 
288
  This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
289
 
 
416
 
417
  #### Tips and tricks
418
 
419
+ **Mitigating well known biases of LLM as judges**
 
420
 
421
+ <Note title="Known LLM judge biases and mitigations" emoji="⚠️" variant="warning">
422
  **Lack of internal consistency**:
423
+
424
+ A judge might give you different judgments if you prompt it several times (if the temperature is not 0)
425
  ➡️ You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
426
 
427
+ **Self-preference**
428
+
429
+ Models tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
430
  ➡️ You can mitigate this by using a jury
431
 
432
+ **Blindness to input perturbation**
433
+
434
+ Models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale.
435
+
436
+ Mitigations:
437
  ➡️ asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
438
  ➡️ or providing a coherent grading scale in the prompt.
439
 
440
+ **Position-bias**.
441
+ Models tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice. Mitigations:
442
  ➡️ switching answer positions randomly
443
  ➡️ computing the log-probabilities of all possible choices to get a normalized answer
444
 
445
+ **Verbosity-bias** (or length-bias)
446
+ Models tend to like more verbose answers
447
  ➡️ You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
448
 
449
+ **Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
450
+ <Sidenote> However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.</Sidenote>
451
+
452
+ **Format bias**
453
+ Models tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
454
  ➡️ You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
455
  </Note>
456
 
 
563
  ### Cost and efficiency
564
 
565
 
566
+ When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
567
 
568
+ <img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="max-width: 400px; height: auto; display: block; margin: 0 auto;" />
 
 
569
 
570
  We suggest you report the following:
571
  - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED
@@ -5,11 +5,7 @@ title: "2025 evaluations"
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
7
 
8
- You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).
9
-
10
- <Note>
11
- Feel free to skim this section if you're not very familiar with evaluation yet, and come back to it once you need to find a dataset for a specific capability :)
12
- </Note>
13
 
14
  #### Reasoning and commonsense
15
 
@@ -118,8 +114,6 @@ I believe that **assistant tasks** are going to be one of the main ways to do ne
118
 
119
  It was later replicated in [BrowseComp](https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf) (2025) which tests the same thing (can a model find the adequate answer to a specific query using tools and online information) but does not guarantee uniqueness of result, as questions were constructed by starting from the result and building a question from it, with varying levels of difficulty: for example, from a specific paper to retrieve, a question will be created by combining information about metadata, for example "which paper about Topic was published at Conference with one Nationality author and two people from Entity?" However, the benchmark is probably also harder at the moment.
120
 
121
- [**GDPval**](https://arxiv.org/abs/2510.04374) (2025) evaluates models on 44 occupations from the “top industries contributing to US GDP", comparing model performance with human performance using model judges.
122
-
123
  Lastly, [GAIA2](https://huggingface.co/blog/gaia2) went beyond simple information retrieval, using a mock up mobile environment to test how assistants are able to answer correctly answer queries relying on chains of events and tool calls. As of now, time sensitive and deliberately noisy subsets (mocking up failing API calls) are the hardest for models, when search and execution seem extremely easy for SOTA models.
124
 
125
  **Science assistants**
@@ -132,6 +126,8 @@ Lastly, [GAIA2](https://huggingface.co/blog/gaia2) went beyond simple informatio
132
 
133
  [**DABStep**](https://arxiv.org/abs/2506.23719) (2025) evaluates model on previously private (therefore uncontaminated) operational data analysis workloads using real life questions and data. All problems require multi step reasoning and varied document parsing, as well of course as specific data manipulation skills. It's a neat eval because it's hard and replicates actually useful real world use cases, and because each problem has a ground truth, so evaluation is unbiased and not too costly.
134
 
 
 
135
  Assistant tasks test integrated capabilities in realistic scenarios, but they're either dynamic and read only, or static in environment which doesn't change. To evaluate adaptability and dynamic decision-making, we need environments that can "surprise" the model.
136
 
137
  #### Game based evaluations
@@ -159,9 +155,7 @@ A similar approach is used to generate questions in [Arbitrage](https://arxiv.or
159
 
160
  In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
161
 
162
- #### Recommendations
163
-
164
- <Note title="TLDR" emoji="🎯" variant="info">
165
  The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
166
 
167
  As of Nov 2025, I recommend using:
@@ -176,8 +170,4 @@ The field is moving toward evaluations that test capability orchestration rather
176
  <Sidenote>
177
  I hope the field moves towards putting more emphasis on functional testing rather than model judges, and generally understandable datasets and tasks.
178
  </Sidenote>
179
- </Note>
180
-
181
- <Note>
182
- If you want to explore even more datasets, you'll find a big list of older interesting benchmarks [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md) with my notes.
183
- </Note>
 
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
7
 
8
+ You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).
 
 
 
 
9
 
10
  #### Reasoning and commonsense
11
 
 
114
 
115
  It was later replicated in [BrowseComp](https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf) (2025) which tests the same thing (can a model find the adequate answer to a specific query using tools and online information) but does not guarantee uniqueness of result, as questions were constructed by starting from the result and building a question from it, with varying levels of difficulty: for example, from a specific paper to retrieve, a question will be created by combining information about metadata, for example "which paper about Topic was published at Conference with one Nationality author and two people from Entity?" However, the benchmark is probably also harder at the moment.
116
 
 
 
117
  Lastly, [GAIA2](https://huggingface.co/blog/gaia2) went beyond simple information retrieval, using a mock up mobile environment to test how assistants are able to answer correctly answer queries relying on chains of events and tool calls. As of now, time sensitive and deliberately noisy subsets (mocking up failing API calls) are the hardest for models, when search and execution seem extremely easy for SOTA models.
118
 
119
  **Science assistants**
 
126
 
127
  [**DABStep**](https://arxiv.org/abs/2506.23719) (2025) evaluates model on previously private (therefore uncontaminated) operational data analysis workloads using real life questions and data. All problems require multi step reasoning and varied document parsing, as well of course as specific data manipulation skills. It's a neat eval because it's hard and replicates actually useful real world use cases, and because each problem has a ground truth, so evaluation is unbiased and not too costly.
128
 
129
+ [**GDPval**](https://arxiv.org/abs/2510.04374) (2025) evaluates models on 44 occupations from the “top industries contributing to US GDP", comparing model performance with human performance using model judges.
130
+
131
  Assistant tasks test integrated capabilities in realistic scenarios, but they're either dynamic and read only, or static in environment which doesn't change. To evaluate adaptability and dynamic decision-making, we need environments that can "surprise" the model.
132
 
133
  #### Game based evaluations
 
155
 
156
  In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
157
 
158
+ <Note title="TLDR" emoji="🎯">
 
 
159
  The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
160
 
161
  As of Nov 2025, I recommend using:
 
170
  <Sidenote>
171
  I hope the field moves towards putting more emphasis on functional testing rather than model judges, and generally understandable datasets and tasks.
172
  </Sidenote>
173
+ </Note>
 
 
 
 
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED
@@ -52,7 +52,7 @@ I would strongly recommend reading a longer explanation on how BPE works, as it'
52
  - [BPE Paper (for text, as the method existed before in other fields)](https://aclanthology.org/P16-1162/)
53
  </Note>
54
 
55
- Building a tokenizer requires making more choices than one would expect. For example, to tokenize numbers, you don't want to use a basic BPE, but do you only index 0 to 9, and assume all other numbers will be compositions of digits? Do you want to store numbers up to, say, one billion, individually?
56
 
57
  Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. This will affect some mathematical evaluation (and is the reason why almost no evaluation is pure arithmetics).
58
 
@@ -67,9 +67,8 @@ Current well known models display a range of approaches to this, but it's unclea
67
 
68
  Pre-2022, models used to simply be pretrained: text in, text out, nothing else. Then, we got instruction tuning and chat models in 2023, and in 2025 reasoning models. This means that we went from using raw text to using more and more formatting.
69
 
70
- <Wide>
71
- <HtmlEmbed src="d3-tokenization-timeline.html" />
72
- </Wide>
73
 
74
  This means a number of models are going to perform terribly if you do not make sure to:
75
  1. respect the format the model expectes
@@ -98,7 +97,7 @@ First, as some languages do not always use spacing as a word separator (Korean,
98
  Then, tokenizers in general might be unfair to non-English languages. When training a BPE tokenizer, you use data from the different languages you want to cover, but most of the time, though, this data is unbalanced between languages (with, for example, an order of magnitude more English than Thai, or Burmese). Since BPE tokenizers create their vocabulary tokens based on the most frequent words seen, most of the long tokens will be English words - and most of the words from the less frequent languages will only be split at the character level. This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
99
 
100
  <Wide>
101
- <Reference align="center" caption="Very nice demo by Yennie Jun on tokenization issues across languages">
102
  <iframe
103
  className="card"
104
  src="https://OpenEvals-tokenizers-languages.hf.space"
@@ -133,7 +132,7 @@ From this input text, the LLM generates a probability distribution of the most l
133
 
134
  **Generative evaluations**: Given a prompt, what text does my model generate?
135
 
136
- Choice depends on your task (as we'll see below) and on your model: most models under APIs do not return the logprobabilities, so you'll need to use generative evaluations systematically to evaluate them.
137
 
138
  </Note>
139
 
 
52
  - [BPE Paper (for text, as the method existed before in other fields)](https://aclanthology.org/P16-1162/)
53
  </Note>
54
 
55
+ When building a tokenizer require making more choices than one would expect. For example, to tokenize numbers, you don't want to use a basic BPE, but do you only index 0 to 9, and assume all other numbers will be compositions of digits? Do you want to store numbers up to, say, one billion, individually?
56
 
57
  Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. This will affect some mathematical evaluation (and is the reason why almost no evaluation is pure arithmetics).
58
 
 
67
 
68
  Pre-2022, models used to simply be pretrained: text in, text out, nothing else. Then, we got instruction tuning and chat models in 2023, and in 2025 reasoning models. This means that we went from using raw text to using more and more formatting.
69
 
70
+ <HtmlEmbed src="d3-tokenization-timeline.html" frameless />
71
+
 
72
 
73
  This means a number of models are going to perform terribly if you do not make sure to:
74
  1. respect the format the model expectes
 
97
  Then, tokenizers in general might be unfair to non-English languages. When training a BPE tokenizer, you use data from the different languages you want to cover, but most of the time, though, this data is unbalanced between languages (with, for example, an order of magnitude more English than Thai, or Burmese). Since BPE tokenizers create their vocabulary tokens based on the most frequent words seen, most of the long tokens will be English words - and most of the words from the less frequent languages will only be split at the character level. This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
98
 
99
  <Wide>
100
+ <Reference align="center" caption="OpenEvals-tokenizers-languages">
101
  <iframe
102
  className="card"
103
  src="https://OpenEvals-tokenizers-languages.hf.space"
 
132
 
133
  **Generative evaluations**: Given a prompt, what text does my model generate?
134
 
135
+ Choice depends on your task: multiple-choice questions use log-likelihood, while open-ended tasks require generative evaluation.
136
 
137
  </Note>
138
 
app/src/content/chapters/intro.mdx CHANGED
@@ -9,7 +9,7 @@ import Quote from "../../components/Quote.astro";
9
 
10
  ## What is model evaluation about?
11
 
12
- As you navigate the world of LLMs — whether you're training or fine-tuning your own models, selecting one for your application, or trying to understand the state of the field — there is one question you have likely stumbled upon:
13
 
14
  <Quote>
15
  How can one know if a model is *good*?
@@ -17,13 +17,11 @@ How can one know if a model is *good*?
17
 
18
  The answer is (surprisingly given the blog topic) evaluation! It's everywhere: leaderboards ranking models, benchmarks claiming to measure *reasoning*, *knowledge*, *coding abilities* or *math performance*, papers announcing new state-of-the-art results...
19
 
20
- But what is evaluation, really? And what can it really tell you?
21
 
22
- This guide is here to help you understand it all: what evaluation can and cannot do, when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want.
23
 
24
- Through the guide, we'll also highlight common pitfalls, tips and tricks from the Open Evals team, and hopefully help you learn how to think critically about the claims made from evaluation results.
25
-
26
- <Sidenote>In this guide, we focus on evaluations for language (mostly natural language), but many principles also apply to other modalities </Sidenote>
27
 
28
  Before we dive into the details, let's quickly look at why people do evaluation, as who you are and what you are working on will determine which evaluations you need to use.
29
 
@@ -32,11 +30,9 @@ Before we dive into the details, let's quickly look at why people do evaluation,
32
  If you are a researcher or engineer creating a new model, your goal is likely to build a strong model that performs well on a set of tasks. For a base model (training from scratch), you want the model to do well on a general tasks, measuring a variety of different capabilities. If you are post-training a base model for a specific use case, you probably care more about the performance on that specific task. The way you measure performance, in either case, is through evaluations.
33
 
34
  As you experiment with different architectures, data mixtures, and training recipes, you want to make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties, and possibly even improved it. The way you test for the impact of different design choices is through **ablations**: an ablation is an experiment where you typically train a model under a specific setup, evaluate it on your chosen set of tasks, and compare the results to a baseline model.
35
- Therefore, the choice of evaluation tasks is critical for ablations, as they determine what you will be optimizing for as you create your model.
36
-
37
- <HtmlEmbed src="d3-ablation-workflow.html" title="Ablation example"/>
38
 
39
- For base models, one would typically resort to selecting standard benchmark tasks used by other model builders (think the classic list of benchmarks that are always reported when a new model is released - we'll have a look at those below). For a specific use case, you can either use existing evaluation tasks if they are available -- and you likely will want to take a good look if they are not "standard" -- or design your own (discussed below). As you will likely run a lot of ablations, you want the evaluation tasks to provide strong enough signal (and not just meaningless noisy results) and you want them to run cheaply and quickly, so that you can iterate fast.
 
40
  Through ablations, we are also able to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws.
41
 
42
  Besides ablations for experiments, you will likely also want to run evaluations on intermediate checkpoints as your model is training, to ensure it is properly learning and improving at the different tasks, and does not start regressing due to spikes or other issues. Finally, you want to evaluate the final checkpoint so that you can announce that your model is SOTA when you release it.
@@ -61,11 +57,6 @@ In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benc
61
 
62
  Similarly to model builders hillclimbing a specific capability, for less common topics, you might need to think about designing your own evaluations, which is detailed in our last section.
63
 
64
- <Note title="Takeaways" emoji="🎯" variant="info">
65
- - Model builder: You need fast, high-signal benchmarks that cover the domains/capabilities you care about and can be run repeatedly during ablations.
66
- - Model user: You need benchmarks that match your specific use case, even if that means creating custom ones.
67
- </Note>
68
-
69
  <Note title="What about measuring AGI?">
70
  We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.
71
 
 
9
 
10
  ## What is model evaluation about?
11
 
12
+ As you navigate the world of LLMs — whether you're training or fine-tuning your own models, selecting one for your application, or trying to understand the state of the field — there is one question you have likely asked stumbled upon:
13
 
14
  <Quote>
15
  How can one know if a model is *good*?
 
17
 
18
  The answer is (surprisingly given the blog topic) evaluation! It's everywhere: leaderboards ranking models, benchmarks claiming to measure *reasoning*, *knowledge*, *coding abilities* or *math performance*, papers announcing new state-of-the-art results...
19
 
20
+ But what is it, really? And what can it really tell you?
21
 
22
+ This guide is here to help you understand evaluation: what it can and cannot do and when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want.
23
 
24
+ Through the guide, we'll also highlight common pitfalls, tips and tricks from the Open Evals team, and hopefully will help you learn how to think critically about the claims made from evaluation results.
 
 
25
 
26
  Before we dive into the details, let's quickly look at why people do evaluation, as who you are and what you are working on will determine which evaluations you need to use.
27
 
 
30
  If you are a researcher or engineer creating a new model, your goal is likely to build a strong model that performs well on a set of tasks. For a base model (training from scratch), you want the model to do well on a general tasks, measuring a variety of different capabilities. If you are post-training a base model for a specific use case, you probably care more about the performance on that specific task. The way you measure performance, in either case, is through evaluations.
31
 
32
  As you experiment with different architectures, data mixtures, and training recipes, you want to make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties, and possibly even improved it. The way you test for the impact of different design choices is through **ablations**: an ablation is an experiment where you typically train a model under a specific setup, evaluate it on your chosen set of tasks, and compare the results to a baseline model.
 
 
 
33
 
34
+ Therefore, the choice of evaluation tasks is critical for ablations, as they determine what you will be optimizing for as you create your model.
35
+ For base models, one would typically resort to selecting standard benchmark tasks used by other model builders (think the classic list of benchmarks that are always reported when a new model is released). For a specific use case, you can either use existing evaluation tasks if they are available -- and you likely will want to take a good look if they are not "standard" -- or design your own (discussed below). As you will likely run a lot of ablations, you want the evaluation tasks to provide strong enough signal (and not just meaningless noisy results) and you want them to run cheaply and quickly, so that you can iterate fast.
36
  Through ablations, we are also able to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws.
37
 
38
  Besides ablations for experiments, you will likely also want to run evaluations on intermediate checkpoints as your model is training, to ensure it is properly learning and improving at the different tasks, and does not start regressing due to spikes or other issues. Finally, you want to evaluate the final checkpoint so that you can announce that your model is SOTA when you release it.
 
57
 
58
  Similarly to model builders hillclimbing a specific capability, for less common topics, you might need to think about designing your own evaluations, which is detailed in our last section.
59
 
 
 
 
 
 
60
  <Note title="What about measuring AGI?">
61
  We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.
62
 
app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx CHANGED
@@ -60,7 +60,7 @@ We did some experiments on this (you'll see up to a 7 points difference for the
60
 
61
  *Evaluation on MMLU subsets, acc_norm score (seed 0), in 5-shot.*
62
 
63
- <HtmlEmbed frameless src="d3-mmlu-heatmap.html" />
64
 
65
  This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
66
  </Note>
 
60
 
61
  *Evaluation on MMLU subsets, acc_norm score (seed 0), in 5-shot.*
62
 
63
+ <HtmlEmbed src="d3-mmlu-heatmap.html" />
64
 
65
  This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
66
  </Note>
app/src/content/embeds/banner.html CHANGED
@@ -1,5 +1,4 @@
1
  <div class="d3-leaderboard-chart-wrapper" style="width:100%;margin:10px 0;padding:10px 5px 5px 5px;border-radius:8px;background:var(--surface-bg);border:1px solid var(--border-color);position:relative;">
2
- <h3 class="d3-chart-title" style="margin:10px 0 15px 15px;font-size:16px;font-weight:600;color:var(--text-color);opacity:0.9;white-space:nowrap;text-align:left;display:block;width:100%;">The benchmark lifecycle</h3>
3
  <div class="d3-leaderboard-chart" style="width:100%;aspect-ratio:2.8/1;min-height:320px;"></div>
4
  </div>
5
  <style>
@@ -312,8 +311,8 @@
312
  infoIcon = document.createElement('div');
313
  infoIcon.className = 'd3-info-icon';
314
  infoIcon.innerHTML = `
315
- <svg width="20" height="20" viewBox="0 0 20 20" fill="none" xmlns="http://www.w3.org/2000/svg">
316
- <path d="M8 6C8 4.89543 8.89543 4 10 4C11.1046 4 12 4.89543 12 6C12 7.10457 11.1046 8 10 8V10M10 14H10.01" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
317
  </svg>
318
  `;
319
  wrapper.appendChild(infoIcon);
@@ -327,18 +326,17 @@
327
  <div style="font-weight: 600; margin-bottom: 10px; color: var(--text-color); font-size: 13px; text-align: left;">About this chart</div>
328
  <div style="color: var(--text-color); font-size: 12px; line-height: 1.6; text-align: left;">
329
  <p style="margin: 0 0 10px 0; text-align: left;">
330
- This visualization tracks the evolution of top benchmark scores over time across 3 leaderboards managed by Hugging Face
331
- through the years: the Open LLM Leaderboard 1, 2, and the GAIA leaderboard.
332
  The step-like lines represent the progression of maximum scores achieved for each benchmark, with circular markers
333
- indicating when a new record was set. It illustrates a phenomenon known as saturation.
334
  </p>
335
  <p style="margin: 0 0 10px 0; text-align: left;">
336
- The gray scatter plot in the background shows the average scores of all evaluated models for a given leaderboard
337
- at a given time, and allows to follow the trend of submission for each leaderboard.
338
  </p>
339
  <p style="margin: 0; text-align: left;">
340
  Benchmarks are grouped by category (Reasoning & Commonsense, Knowledge, Math, Agentic, and Instruction following),
341
- with each group sharing a color family.
342
  </p>
343
  </div>
344
  `;
@@ -736,6 +734,7 @@
736
  .attr('stroke-width', 1)
737
  .attr('stroke-dasharray', '2,2');
738
 
 
739
  // Line generator - courbe en escalier (step) pour afficher des seuils successifs
740
  // La ligne reste constante jusqu'au prochain point
741
  const line = d3.line()
@@ -771,8 +770,6 @@
771
  g.selectAll(`.legend-${displayName}`).style('opacity', 0.3);
772
  }
773
  });
774
- // Ghost aussi les nuages de points
775
- g.selectAll('.scatter-point').style('opacity', 0.1);
776
  };
777
 
778
  const resetHighlight = () => {
@@ -782,8 +779,6 @@
782
  const displayName = benchmark === 'MMLU_new' ? 'MMLU-Pro' : benchmark;
783
  g.selectAll(`.legend-${displayName}`).style('opacity', 1);
784
  });
785
- // Réinitialiser aussi les nuages de points
786
- g.selectAll('.scatter-point').style('opacity', 1);
787
  };
788
 
789
  // Ajouter le nuage de points EN PREMIER (en dessous de tout)
@@ -939,7 +934,7 @@
939
  .style('color', 'var(--text-color)')
940
  .style('opacity', '0.8')
941
  .style('margin-bottom', '8px')
942
- .text('Domains');
943
 
944
  const legendDiv = legendWrapper.append('xhtml:div')
945
  .style('display', 'flex')
@@ -1045,31 +1040,11 @@
1045
  .style('left', `${left}px`)
1046
  .style('top', `${top}px`);
1047
 
1048
- // Highlight TOUS les benchmarks du groupe en même temps
1049
- // D'abord, obtenir les clés de données pour tous les benchmarks du groupe
1050
- const groupBenchmarkKeys = group.benchmarks.map(benchmark => {
1051
- return benchmark === 'MMLU-Pro' ? 'MMLU_new' : benchmark;
1052
- });
1053
-
1054
- // Mettre en évidence tous les benchmarks du groupe
1055
- benchmarks.forEach(benchmark => {
1056
- const displayName = benchmark === 'MMLU_new' ? 'MMLU-Pro' : benchmark;
1057
- const isInGroup = groupBenchmarkKeys.includes(benchmark);
1058
-
1059
- if (isInGroup) {
1060
- // Mettre en évidence la ligne sélectionnée
1061
- g.selectAll(`.line-${benchmark}`).style('opacity', 1).attr('stroke-width', 3);
1062
- g.selectAll(`.marker-${benchmark}`).style('opacity', 1);
1063
- g.selectAll(`.legend-${displayName}`).style('opacity', 1);
1064
- } else {
1065
- // Ghost les autres lignes
1066
- g.selectAll(`.line-${benchmark}`).style('opacity', 0.15);
1067
- g.selectAll(`.marker-${benchmark}`).style('opacity', 0.15);
1068
- g.selectAll(`.legend-${displayName}`).style('opacity', 0.3);
1069
- }
1070
  });
1071
- // Ghost aussi les nuages de points
1072
- g.selectAll('.scatter-point').style('opacity', 0.1);
1073
  }).on('mouseleave', function() {
1074
  d3.select(legendTooltip).style('opacity', '0');
1075
  resetHighlight();
 
1
  <div class="d3-leaderboard-chart-wrapper" style="width:100%;margin:10px 0;padding:10px 5px 5px 5px;border-radius:8px;background:var(--surface-bg);border:1px solid var(--border-color);position:relative;">
 
2
  <div class="d3-leaderboard-chart" style="width:100%;aspect-ratio:2.8/1;min-height:320px;"></div>
3
  </div>
4
  <style>
 
311
  infoIcon = document.createElement('div');
312
  infoIcon.className = 'd3-info-icon';
313
  infoIcon.innerHTML = `
314
+ <svg width="16" height="16" viewBox="0 0 16 16" fill="none" xmlns="http://www.w3.org/2000/svg">
315
+ <path d="M8 6V8M8 10H8.01" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/>
316
  </svg>
317
  `;
318
  wrapper.appendChild(infoIcon);
 
326
  <div style="font-weight: 600; margin-bottom: 10px; color: var(--text-color); font-size: 13px; text-align: left;">About this chart</div>
327
  <div style="color: var(--text-color); font-size: 12px; line-height: 1.6; text-align: left;">
328
  <p style="margin: 0 0 10px 0; text-align: left;">
329
+ This visualization tracks the evolution of top benchmark scores over time across multiple evaluation frameworks.
 
330
  The step-like lines represent the progression of maximum scores achieved for each benchmark, with circular markers
331
+ indicating when a new record was set.
332
  </p>
333
  <p style="margin: 0 0 10px 0; text-align: left;">
334
+ The gray scatter plot in the background shows the average scores of all evaluated models, providing context for
335
+ the top performers. Each point represents a model's average performance across all benchmarks at a given time.
336
  </p>
337
  <p style="margin: 0; text-align: left;">
338
  Benchmarks are grouped by category (Reasoning & Commonsense, Knowledge, Math, Agentic, and Instruction following),
339
+ with each group sharing a color family. Variations within a group use different shades of the same base color.
340
  </p>
341
  </div>
342
  `;
 
734
  .attr('stroke-width', 1)
735
  .attr('stroke-dasharray', '2,2');
736
 
737
+
738
  // Line generator - courbe en escalier (step) pour afficher des seuils successifs
739
  // La ligne reste constante jusqu'au prochain point
740
  const line = d3.line()
 
770
  g.selectAll(`.legend-${displayName}`).style('opacity', 0.3);
771
  }
772
  });
 
 
773
  };
774
 
775
  const resetHighlight = () => {
 
779
  const displayName = benchmark === 'MMLU_new' ? 'MMLU-Pro' : benchmark;
780
  g.selectAll(`.legend-${displayName}`).style('opacity', 1);
781
  });
 
 
782
  };
783
 
784
  // Ajouter le nuage de points EN PREMIER (en dessous de tout)
 
934
  .style('color', 'var(--text-color)')
935
  .style('opacity', '0.8')
936
  .style('margin-bottom', '8px')
937
+ .text('Legend');
938
 
939
  const legendDiv = legendWrapper.append('xhtml:div')
940
  .style('display', 'flex')
 
1040
  .style('left', `${left}px`)
1041
  .style('top', `${top}px`);
1042
 
1043
+ // Highlight tous les benchmarks du groupe
1044
+ group.benchmarks.forEach(benchmark => {
1045
+ const displayName = benchmark;
1046
+ highlightBenchmark(displayName);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1047
  });
 
 
1048
  }).on('mouseleave', function() {
1049
  d3.select(legendTooltip).style('opacity', '0');
1050
  resetHighlight();
app/src/content/embeds/d3-ablation-workflow.html DELETED
@@ -1,474 +0,0 @@
1
- <div class="d3-ablation-workflow"></div>
2
-
3
- <style>
4
- .d3-ablation-workflow {
5
- font-family: var(--default-font-family);
6
- background: transparent;
7
- border: none;
8
- border-radius: 0;
9
- padding: var(--spacing-4) 0;
10
- width: 100%;
11
- margin: 0 auto;
12
- position: relative;
13
- }
14
-
15
- .d3-ablation-workflow svg {
16
- width: 100%;
17
- height: auto;
18
- display: block;
19
- }
20
-
21
- .d3-ablation-workflow .stage-box {
22
- stroke-width: 2;
23
- transition: all 0.3s ease;
24
- }
25
-
26
- .d3-ablation-workflow .stage-box:hover {
27
- filter: brightness(1.1);
28
- stroke-width: 3;
29
- }
30
-
31
- .d3-ablation-workflow .stage-label {
32
- fill: var(--text-color);
33
- font-size: 12px;
34
- font-weight: 700;
35
- pointer-events: none;
36
- user-select: none;
37
- text-transform: uppercase;
38
- letter-spacing: 0.05em;
39
- }
40
-
41
- .d3-ablation-workflow .item-label {
42
- fill: var(--text-color);
43
- font-size: 11px;
44
- font-weight: 600;
45
- pointer-events: none;
46
- user-select: none;
47
- }
48
-
49
- .d3-ablation-workflow .arrow-line {
50
- fill: none;
51
- stroke-width: 2;
52
- transition: all 0.3s ease;
53
- }
54
-
55
- .d3-ablation-workflow .marker {
56
- opacity: 0.7;
57
- }
58
-
59
- .d3-ablation-workflow .training-curve {
60
- fill: none;
61
- stroke-width: 2;
62
- transition: all 0.3s ease;
63
- }
64
-
65
- .d3-ablation-workflow .score-bar {
66
- transition: all 0.3s ease;
67
- }
68
-
69
- .d3-ablation-workflow .score-bar:hover {
70
- filter: brightness(1.15);
71
- }
72
-
73
- .d3-ablation-workflow .score-text {
74
- fill: var(--text-color);
75
- font-size: 10px;
76
- font-weight: 600;
77
- pointer-events: none;
78
- user-select: none;
79
- }
80
-
81
- .d3-ablation-workflow .axis-label {
82
- fill: var(--muted-color);
83
- font-size: 9px;
84
- font-weight: 500;
85
- pointer-events: none;
86
- user-select: none;
87
- }
88
-
89
- .d3-ablation-workflow .legend-text {
90
- font-size: 13px;
91
- line-height: 1.6;
92
- color: var(--text-color);
93
- text-align: center;
94
- margin-top: var(--spacing-3);
95
- padding: 0 var(--spacing-4);
96
- }
97
-
98
- .d3-ablation-workflow .d3-tooltip {
99
- position: absolute;
100
- background: var(--surface-bg);
101
- border: 1px solid var(--border-color);
102
- border-radius: 8px;
103
- padding: 8px 10px;
104
- font-size: 12px;
105
- pointer-events: none;
106
- opacity: 0;
107
- transition: opacity 0.12s ease;
108
- box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
109
- z-index: 1000;
110
- max-width: 300px;
111
- line-height: 1.35;
112
- white-space: pre-line;
113
- color: var(--text-color);
114
- transform: translate(-9999px, -9999px);
115
- }
116
-
117
- @media (max-width: 768px) {
118
- .d3-ablation-workflow .stage-label {
119
- font-size: 10px;
120
- }
121
-
122
- .d3-ablation-workflow .item-label {
123
- font-size: 10px;
124
- }
125
-
126
- .d3-ablation-workflow .score-text {
127
- font-size: 9px;
128
- }
129
- }
130
- </style>
131
-
132
- <script>
133
- (() => {
134
- const ensureD3 = (cb) => {
135
- if (window.d3 && typeof window.d3.select === 'function') return cb();
136
- let s = document.getElementById('d3-cdn-script');
137
- if (!s) {
138
- s = document.createElement('script');
139
- s.id = 'd3-cdn-script';
140
- s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
141
- document.head.appendChild(s);
142
- }
143
- const onReady = () => {
144
- if (window.d3 && typeof window.d3.select === 'function') cb();
145
- };
146
- s.addEventListener('load', onReady, { once: true });
147
- if (window.d3) onReady();
148
- };
149
-
150
- const bootstrap = () => {
151
- const scriptEl = document.currentScript;
152
- let container = scriptEl ? scriptEl.previousElementSibling : null;
153
- if (!(container && container.classList && container.classList.contains('d3-ablation-workflow'))) {
154
- const candidates = Array.from(document.querySelectorAll('.d3-ablation-workflow'))
155
- .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
156
- container = candidates[candidates.length - 1] || null;
157
- }
158
-
159
- if (!container) return;
160
-
161
- if (container.dataset) {
162
- if (container.dataset.mounted === 'true') return;
163
- container.dataset.mounted = 'true';
164
- }
165
-
166
- container.style.position = container.style.position || 'relative';
167
-
168
- // Get colors from ColorPalettes or fallback
169
- const getColors = () => {
170
- if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
171
- return window.ColorPalettes.getColors('categorical', 3);
172
- }
173
- return ['#1f77b4', '#ff7f0e', '#2ca02c'];
174
- };
175
-
176
- // Data for two ablations: Wikipedia vs Reddit
177
- const ablations = [
178
- {
179
- id: 'wiki',
180
- name: 'Wikipedia',
181
- color_idx: 0,
182
- trainingData: [
183
- { step: 0, loss: 4.5 },
184
- { step: 1000, loss: 3.2 },
185
- { step: 2000, loss: 2.4 },
186
- { step: 3000, loss: 1.9 },
187
- { step: 4000, loss: 1.5 },
188
- { step: 5000, loss: 1.3 }
189
- ],
190
- finalScore: 72
191
- },
192
- {
193
- id: 'reddit',
194
- name: 'Reddit',
195
- color_idx: 1,
196
- trainingData: [
197
- { step: 0, loss: 4.5 },
198
- { step: 1000, loss: 3.5 },
199
- { step: 2000, loss: 2.8 },
200
- { step: 3000, loss: 2.3 },
201
- { step: 4000, loss: 2.0 },
202
- { step: 5000, loss: 1.8 }
203
- ],
204
- finalScore: 65
205
- }
206
- ];
207
-
208
- const svg = d3.select(container).append('svg');
209
- const g = svg.append('g');
210
-
211
- // Add legend text below the chart
212
- const legendDiv = document.createElement('div');
213
- legendDiv.className = 'legend-text';
214
- legendDiv.textContent = 'Say you want to compare dataset A and dataset B (for example, Wikipedia vs Reddit) to see how they affect model performance. You train models under the same setups on each, then evaluate and compare the scores on benchmarks.';
215
- container.appendChild(legendDiv);
216
-
217
- // Arrow markers
218
- const defs = svg.append('defs');
219
- getColors().forEach((color, i) => {
220
- defs.append('marker')
221
- .attr('id', `arrow-ablation-${i}`)
222
- .attr('viewBox', '0 -5 10 10')
223
- .attr('refX', 9)
224
- .attr('refY', 0)
225
- .attr('markerWidth', 10)
226
- .attr('markerHeight', 10)
227
- .attr('orient', 'auto')
228
- .append('path')
229
- .attr('d', 'M0,-5L10,0L0,5')
230
- .attr('fill', color)
231
- .attr('fill-opacity', 0.8);
232
- });
233
-
234
- // Big arrow marker for the right side
235
- defs.append('marker')
236
- .attr('id', 'arrow-big')
237
- .attr('viewBox', '0 -5 10 10')
238
- .attr('refX', 9)
239
- .attr('refY', 0)
240
- .attr('markerWidth', 10)
241
- .attr('markerHeight', 10)
242
- .attr('orient', 'auto')
243
- .append('path')
244
- .attr('d', 'M0,-5L10,0L0,5')
245
- .attr('fill', 'var(--primary-color)')
246
- .attr('fill-opacity', 0.8);
247
-
248
- let width = 800;
249
- let height = 400;
250
-
251
- // Icons as SVG paths
252
- const iconPaths = {
253
- database: 'M12 2C6.48 2 2 5.02 2 8.5V15.5C2 18.98 6.48 22 12 22C17.52 22 22 18.98 22 15.5V8.5C22 5.02 17.52 2 12 2ZM12 4C16.42 4 20 6.24 20 8.5C20 10.76 16.42 13 12 13C7.58 13 4 10.76 4 8.5C4 6.24 7.58 4 12 4ZM4 11.03C5.89 12.33 8.78 13 12 13C15.22 13 18.11 12.33 20 11.03V15.5C20 17.76 16.42 20 12 20C7.58 20 4 17.76 4 15.5V11.03Z',
254
- chart: 'M3 13h2v7H3v-7zm4-6h2v13H7V7zm4-4h2v17h-2V3zm4 8h2v9h-2v-9z'
255
- };
256
-
257
- // Function to draw a simple neural network schematic
258
- function drawModelSchematic(g, x, y, size, color) {
259
- const layers = [3, 4, 3]; // neurons per layer
260
- const layerSpacing = size / 3;
261
- const neuronRadius = size / 25;
262
-
263
- layers.forEach((neuronsCount, layerIdx) => {
264
- const layerX = x + layerIdx * layerSpacing;
265
- const neuronSpacing = size / (neuronsCount + 1);
266
-
267
- for (let i = 0; i < neuronsCount; i++) {
268
- const neuronY = y + (i + 1) * neuronSpacing;
269
-
270
- // Draw connections to next layer
271
- if (layerIdx < layers.length - 1) {
272
- const nextLayerX = x + (layerIdx + 1) * layerSpacing;
273
- const nextNeuronSpacing = size / (layers[layerIdx + 1] + 1);
274
-
275
- for (let j = 0; j < layers[layerIdx + 1]; j++) {
276
- const nextNeuronY = y + (j + 1) * nextNeuronSpacing;
277
- g.append('line')
278
- .attr('x1', layerX)
279
- .attr('y1', neuronY)
280
- .attr('x2', nextLayerX)
281
- .attr('y2', nextNeuronY)
282
- .attr('stroke', color)
283
- .attr('stroke-width', 0.5)
284
- .attr('opacity', 0.3);
285
- }
286
- }
287
-
288
- // Draw neuron
289
- g.append('circle')
290
- .attr('cx', layerX)
291
- .attr('cy', neuronY)
292
- .attr('r', neuronRadius)
293
- .attr('fill', color)
294
- .attr('opacity', 0.8);
295
- }
296
- });
297
- }
298
-
299
- function render() {
300
- width = container.clientWidth || 800;
301
- height = Math.max(300, Math.round(width * 0.45));
302
-
303
- svg.attr('width', width).attr('height', height);
304
-
305
- const margin = { top: 40, right: 20, bottom: 20, left: 20 };
306
- const innerWidth = width - margin.left - margin.right;
307
- const innerHeight = height - margin.top - margin.bottom;
308
-
309
- g.attr('transform', `translate(${margin.left},${margin.top})`);
310
-
311
- // Clear previous content
312
- g.selectAll('*').remove();
313
-
314
- const colors = getColors();
315
-
316
- // Three columns: Data, Training, Scores
317
- const colWidth = innerWidth / 3;
318
- const col1X = colWidth * 0.5;
319
- const col2X = colWidth * 1.5;
320
- const col3X = colWidth * 2.5;
321
-
322
- // Stage titles
323
- g.selectAll('.stage-label')
324
- .data([
325
- { x: col1X, label: 'DATA' },
326
- { x: col2X, label: 'TRAINING' },
327
- { x: col3X, label: 'EVALUATION' }
328
- ])
329
- .join('text')
330
- .attr('class', 'stage-label')
331
- .attr('x', d => d.x)
332
- .attr('y', -20)
333
- .attr('text-anchor', 'middle')
334
- .text(d => d.label);
335
-
336
- // Column 1: Data icons
337
- const dataY = innerHeight * 0.3;
338
- const dataSpacing = innerHeight * 0.35;
339
-
340
- ablations.forEach((abl, i) => {
341
- const y = dataY + i * dataSpacing;
342
- const iconSize = 30;
343
- const boxPadding = 10;
344
-
345
- // Data box
346
- const dataGroup = g.append('g')
347
- .attr('transform', `translate(${col1X - iconSize / 2 - boxPadding},${y - iconSize / 2 - boxPadding})`);
348
-
349
- dataGroup.append('rect')
350
- .attr('class', 'stage-box')
351
- .attr('width', iconSize + boxPadding * 2)
352
- .attr('height', iconSize + boxPadding * 2)
353
- .attr('rx', 8)
354
- .attr('fill', colors[abl.color_idx])
355
- .attr('fill-opacity', 0.15)
356
- .attr('stroke', colors[abl.color_idx]);
357
-
358
- // Database icon
359
- dataGroup.append('path')
360
- .attr('d', iconPaths.database)
361
- .attr('transform', `translate(${boxPadding},${boxPadding}) scale(${iconSize / 24})`)
362
- .attr('fill', colors[abl.color_idx]);
363
-
364
- // Label below
365
- g.append('text')
366
- .attr('class', 'item-label')
367
- .attr('x', col1X)
368
- .attr('y', y + iconSize + boxPadding + 15)
369
- .attr('text-anchor', 'middle')
370
- .attr('fill', colors[abl.color_idx])
371
- .text(abl.name);
372
- });
373
-
374
- // Column 2: Model schematics for training
375
- const modelSize = Math.min(80, colWidth * 0.4);
376
-
377
- ablations.forEach((abl, i) => {
378
- const y = dataY + i * dataSpacing;
379
- const modelX = col2X - modelSize / 2.5;
380
- const modelY = y - modelSize / 2;
381
-
382
- // Draw model schematic
383
- const modelGroup = g.append('g');
384
-
385
- drawModelSchematic(modelGroup, modelX, modelY, modelSize, colors[abl.color_idx]);
386
- });
387
-
388
- // Column 3: Final scores (bar chart)
389
- const barWidth = 40;
390
- const barMaxHeight = innerHeight * 0.6;
391
- const barY = innerHeight * 0.7;
392
-
393
- const scoreScale = d3.scaleLinear()
394
- .domain([0, 100])
395
- .range([0, barMaxHeight]);
396
-
397
- ablations.forEach((abl, i) => {
398
- const x = col3X - (ablations.length * barWidth) / 2 + i * barWidth + barWidth / 2;
399
- const barHeight = scoreScale(abl.finalScore);
400
-
401
- // Bar
402
- g.append('rect')
403
- .attr('class', 'score-bar')
404
- .attr('x', x - barWidth / 2 + 5)
405
- .attr('y', barY - barHeight)
406
- .attr('width', barWidth - 10)
407
- .attr('height', barHeight)
408
- .attr('rx', 4)
409
- .attr('fill', colors[abl.color_idx])
410
- .attr('fill-opacity', 0.7);
411
-
412
- // Score text
413
- g.append('text')
414
- .attr('class', 'score-text')
415
- .attr('x', x)
416
- .attr('y', barY - barHeight - 5)
417
- .attr('text-anchor', 'middle')
418
- .attr('fill', colors[abl.color_idx])
419
- .text(`${abl.finalScore}%`);
420
- });
421
-
422
- // Draw arrows connecting stages
423
- const iconSize = 30;
424
- const boxPadding = 10;
425
-
426
- // Left side: Individual arrows from data to models (with arrowheads)
427
- // Stop the arrows 15px before the model to avoid covering the neural net
428
- ablations.forEach((abl, i) => {
429
- const y = dataY + i * dataSpacing;
430
- const dataEndX = col1X + iconSize / 2 + boxPadding;
431
- const modelStartX = col2X - modelSize / 2 - 5;
432
-
433
- g.append('path')
434
- .attr('class', 'arrow-line')
435
- .attr('d', `M ${dataEndX} ${y} L ${modelStartX} ${y}`)
436
- .attr('stroke', colors[abl.color_idx])
437
- .attr('stroke-width', 3)
438
- .attr('stroke-opacity', 0.5)
439
- .attr('marker-end', `url(#arrow-ablation-${abl.color_idx})`);
440
- });
441
-
442
- // Right side: Single big arrow from training column to evaluation column
443
- const modelEndX = col2X + modelSize / 2;
444
- const evalStartX = col3X - (ablations.length * barWidth) / 2 - 20;
445
- const arrowY = (dataY + dataY + (ablations.length - 1) * dataSpacing) / 2; // Middle between all items
446
-
447
- g.append('path')
448
- .attr('class', 'arrow-line')
449
- .attr('d', `M ${modelEndX} ${arrowY} L ${evalStartX} ${arrowY}`)
450
- .attr('stroke', 'var(--primary-color)')
451
- .attr('stroke-width', 4)
452
- .attr('stroke-opacity', 0.6)
453
- .attr('marker-end', 'url(#arrow-big)');
454
-
455
- }
456
-
457
- render();
458
-
459
- // Responsive handling
460
- if (window.ResizeObserver) {
461
- const ro = new ResizeObserver(() => render());
462
- ro.observe(container);
463
- } else {
464
- window.addEventListener('resize', render);
465
- }
466
- };
467
-
468
- if (document.readyState === 'loading') {
469
- document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
470
- } else {
471
- ensureD3(bootstrap);
472
- }
473
- })();
474
- </script>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/src/content/embeds/d3-human-biases.html DELETED
@@ -1,352 +0,0 @@
1
- <div class="d3-human-biases"></div>
2
-
3
- <style>
4
- .d3-human-biases {
5
- font-family: var(--default-font-family);
6
- background: transparent !important;
7
- border: none !important;
8
- border-radius: 0 !important;
9
- padding: var(--spacing-4) 0;
10
- width: 100%;
11
- margin: 0 auto;
12
- position: relative;
13
- box-shadow: none !important;
14
- }
15
-
16
- .d3-human-biases svg {
17
- width: 100%;
18
- height: auto;
19
- display: block;
20
- }
21
-
22
- .d3-human-biases .card-rect {
23
- stroke-width: 2;
24
- transition: all 0.3s ease;
25
- }
26
-
27
- .d3-human-biases .bias-title {
28
- fill: var(--text-color);
29
- font-size: 12px;
30
- font-weight: 700;
31
- }
32
-
33
- .d3-human-biases .bias-description {
34
- fill: var(--text-color);
35
- font-size: 10px;
36
- font-weight: 400;
37
- line-height: 1.4;
38
- }
39
-
40
- .d3-human-biases .header-text {
41
- fill: var(--text-color);
42
- font-size: 12px;
43
- font-weight: 700;
44
- text-transform: uppercase;
45
- letter-spacing: 0.05em;
46
- }
47
-
48
- .d3-human-biases .example-label {
49
- fill: var(--muted-color);
50
- font-size: 9px;
51
- font-weight: 600;
52
- text-transform: uppercase;
53
- letter-spacing: 0.05em;
54
- }
55
-
56
- @media (max-width: 768px) {
57
- .d3-human-biases .bias-title {
58
- font-size: 10px;
59
- }
60
-
61
- .d3-human-biases .bias-description {
62
- font-size: 9px;
63
- }
64
- }
65
- </style>
66
-
67
- <script>
68
- (() => {
69
- const ensureD3 = (cb) => {
70
- if (window.d3 && typeof window.d3.select === 'function') return cb();
71
- let s = document.getElementById('d3-cdn-script');
72
- if (!s) {
73
- s = document.createElement('script');
74
- s.id = 'd3-cdn-script';
75
- s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
76
- document.head.appendChild(s);
77
- }
78
- const onReady = () => {
79
- if (window.d3 && typeof window.d3.select === 'function') cb();
80
- };
81
- s.addEventListener('load', onReady, { once: true });
82
- if (window.d3) onReady();
83
- };
84
-
85
- const bootstrap = () => {
86
- const scriptEl = document.currentScript;
87
- let container = scriptEl ? scriptEl.previousElementSibling : null;
88
- if (!(container && container.classList && container.classList.contains('d3-human-biases'))) {
89
- const candidates = Array.from(document.querySelectorAll('.d3-human-biases'))
90
- .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
91
- container = candidates[candidates.length - 1] || null;
92
- }
93
-
94
- if (!container) return;
95
-
96
- if (container.dataset) {
97
- if (container.dataset.mounted === 'true') return;
98
- container.dataset.mounted = 'true';
99
- }
100
-
101
- // Get colors from ColorPalettes or fallback
102
- const getColors = () => {
103
- if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
104
- return window.ColorPalettes.getColors('categorical', 4);
105
- }
106
- return ['#e74c3c', '#3498db', '#9b59b6', '#f39c12'];
107
- };
108
-
109
- // Human evaluation biases
110
- const biases = [
111
- {
112
- id: 'first-impression',
113
- title: 'First Impressions',
114
- description: 'Quality estimated from first impressions rather than actual content',
115
- example: 'Well-formatted answer rated higher despite errors',
116
- reference: 'arxiv.org/abs/2309.16349'
117
- },
118
- {
119
- id: 'tone',
120
- title: 'Tone Bias',
121
- description: 'Underestimation of the number of factual or logical errors in an assertive answer',
122
- example: 'Assertive wrong answer > Neutral correct answer',
123
- reference: 'arxiv.org/abs/2309.16349'
124
- },
125
- {
126
- id: 'self-preference',
127
- title: 'Self-Preference',
128
- description: 'Preference for answers aligning with own views, opinons and beliefs',
129
- example: 'Personal beliefs > Factual correctness',
130
- reference: 'arxiv.org/abs/2310.13548'
131
- },
132
- {
133
- id: 'identity',
134
- title: 'Identity Bias',
135
- description: 'Different identity groups rate answers differently',
136
- example: 'Varied toxicity ratings across demographics',
137
- reference: 'arxiv.org/abs/2205.00501',
138
- reference2: 'arxiv.org/abs/2404.16019'
139
- }
140
- ];
141
-
142
- const svg = d3.select(container).append('svg');
143
- const g = svg.append('g');
144
-
145
- let width = 800;
146
- let height = 300;
147
-
148
- // Helper function to wrap text
149
- function wrapText(text, width) {
150
- text.each(function() {
151
- const text = d3.select(this);
152
- const words = text.text().split(/\s+/).reverse();
153
- let word;
154
- let line = [];
155
- let lineNumber = 0;
156
- const lineHeight = 1.3;
157
- const y = text.attr('y');
158
- const x = text.attr('x');
159
- const dy = parseFloat(text.attr('dy') || 0);
160
- let tspan = text.text(null).append('tspan')
161
- .attr('x', x)
162
- .attr('y', y)
163
- .attr('dy', dy + 'em');
164
-
165
- while ((word = words.pop())) {
166
- line.push(word);
167
- tspan.text(line.join(' '));
168
- if (tspan.node().getComputedTextLength() > width) {
169
- line.pop();
170
- tspan.text(line.join(' '));
171
- line = [word];
172
- tspan = text.append('tspan')
173
- .attr('x', x)
174
- .attr('y', y)
175
- .attr('dy', ++lineNumber * lineHeight + dy + 'em')
176
- .text(word);
177
- }
178
- }
179
- });
180
- }
181
-
182
- function render() {
183
- width = container.clientWidth || 800;
184
- height = Math.max(320, Math.round(width * 0.4));
185
-
186
- svg.attr('width', width).attr('height', height);
187
-
188
- const margin = { top: 40, right: 20, bottom: 20, left: 20 };
189
- const innerWidth = width - margin.left - margin.right;
190
- const innerHeight = height - margin.top - margin.bottom;
191
-
192
- g.attr('transform', `translate(${margin.left},${margin.top})`);
193
-
194
- // Clear previous content
195
- g.selectAll('*').remove();
196
-
197
- const colors = getColors();
198
-
199
- // Header
200
- g.append('text')
201
- .attr('class', 'header-text')
202
- .attr('x', innerWidth / 2)
203
- .attr('y', -15)
204
- .attr('text-anchor', 'middle')
205
- .text('HUMAN EVALUATION BIASES');
206
-
207
- // Calculate card dimensions - 2x2 grid
208
- const cols = 2;
209
- const rows = 2;
210
- const cardSpacingX = Math.min(20, innerWidth * 0.03);
211
- const cardSpacingY = Math.min(15, innerHeight * 0.05);
212
- const cardWidth = (innerWidth - cardSpacingX * (cols - 1)) / cols;
213
- const cardHeight = (innerHeight - cardSpacingY * (rows - 1)) / rows;
214
-
215
- // Draw cards in 2x2 grid
216
- biases.forEach((bias, i) => {
217
- const col = i % cols;
218
- const row = Math.floor(i / cols);
219
- const x = col * (cardWidth + cardSpacingX);
220
- const y = row * (cardHeight + cardSpacingY);
221
-
222
- const cardGroup = g.append('g')
223
- .attr('transform', `translate(${x},${y})`);
224
-
225
- // Card background with frame
226
- cardGroup.append('rect')
227
- .attr('class', 'card-rect')
228
- .attr('width', cardWidth)
229
- .attr('height', cardHeight)
230
- .attr('rx', 12)
231
- .attr('fill', colors[i])
232
- .attr('fill-opacity', 0.12)
233
- .attr('stroke', colors[i])
234
- .attr('stroke-opacity', 0.6)
235
- .attr('stroke-width', 2);
236
-
237
- // Title
238
- cardGroup.append('text')
239
- .attr('class', 'bias-title')
240
- .attr('x', cardWidth / 2)
241
- .attr('y', 20)
242
- .attr('text-anchor', 'middle')
243
- .text(bias.title);
244
-
245
- // Description with wrapping
246
- const descText = cardGroup.append('text')
247
- .attr('class', 'bias-description')
248
- .attr('x', cardWidth / 2)
249
- .attr('y', 38)
250
- .attr('text-anchor', 'middle')
251
- .attr('dy', 0)
252
- .text(bias.description);
253
-
254
- wrapText(descText, cardWidth - 20);
255
-
256
- // Example box
257
- const exampleY = cardHeight - 52;
258
- const exampleHeight = 22;
259
-
260
- cardGroup.append('rect')
261
- .attr('x', 8)
262
- .attr('y', exampleY)
263
- .attr('width', cardWidth - 16)
264
- .attr('height', exampleHeight)
265
- .attr('rx', 4)
266
- .attr('fill', colors[i])
267
- .attr('fill-opacity', 0.15)
268
- .attr('stroke', colors[i])
269
- .attr('stroke-width', 1)
270
- .attr('stroke-opacity', 0.4);
271
-
272
- // Example text
273
- const exampleText = cardGroup.append('text')
274
- .attr('class', 'bias-description')
275
- .attr('x', cardWidth / 2)
276
- .attr('y', exampleY + 13)
277
- .attr('text-anchor', 'middle')
278
- .attr('dominant-baseline', 'middle')
279
- .attr('font-size', 9)
280
- .text(bias.example);
281
-
282
- // Reference links (if exist)
283
- if (bias.reference) {
284
- const refLink1 = cardGroup.append('a')
285
- .attr('href', `https://${bias.reference}`)
286
- .attr('target', '_blank')
287
- .attr('rel', 'noopener noreferrer');
288
-
289
- refLink1.append('text')
290
- .attr('class', 'example-label')
291
- .attr('x', cardWidth - 10)
292
- .attr('y', bias.reference2 ? cardHeight - 18 : cardHeight - 8)
293
- .attr('text-anchor', 'end')
294
- .attr('font-size', 8)
295
- .attr('fill', colors[i])
296
- .attr('opacity', 0.7)
297
- .style('cursor', 'pointer')
298
- .style('text-decoration', 'underline')
299
- .text(bias.reference)
300
- .on('mouseenter', function() {
301
- d3.select(this).attr('opacity', 1);
302
- })
303
- .on('mouseleave', function() {
304
- d3.select(this).attr('opacity', 0.7);
305
- });
306
- }
307
-
308
- if (bias.reference2) {
309
- const refLink2 = cardGroup.append('a')
310
- .attr('href', `https://${bias.reference2}`)
311
- .attr('target', '_blank')
312
- .attr('rel', 'noopener noreferrer');
313
-
314
- refLink2.append('text')
315
- .attr('class', 'example-label')
316
- .attr('x', cardWidth - 10)
317
- .attr('y', cardHeight - 8)
318
- .attr('text-anchor', 'end')
319
- .attr('font-size', 8)
320
- .attr('fill', colors[i])
321
- .attr('opacity', 0.7)
322
- .style('cursor', 'pointer')
323
- .style('text-decoration', 'underline')
324
- .text(bias.reference2)
325
- .on('mouseenter', function() {
326
- d3.select(this).attr('opacity', 1);
327
- })
328
- .on('mouseleave', function() {
329
- d3.select(this).attr('opacity', 0.7);
330
- });
331
- }
332
- });
333
- }
334
-
335
- render();
336
-
337
- // Responsive handling
338
- if (window.ResizeObserver) {
339
- const ro = new ResizeObserver(() => render());
340
- ro.observe(container);
341
- } else {
342
- window.addEventListener('resize', render);
343
- }
344
- };
345
-
346
- if (document.readyState === 'loading') {
347
- document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
348
- } else {
349
- ensureD3(bootstrap);
350
- }
351
- })();
352
- </script>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/src/content/embeds/d3-llm-biases.html DELETED
@@ -1,378 +0,0 @@
1
- <div class="d3-llm-biases"></div>
2
-
3
- <style>
4
- .d3-llm-biases {
5
- font-family: var(--default-font-family);
6
- background: transparent !important;
7
- border: none !important;
8
- border-radius: 0 !important;
9
- padding: var(--spacing-4) 0;
10
- width: 100%;
11
- margin: 0 auto;
12
- position: relative;
13
- box-shadow: none !important;
14
- }
15
-
16
- .d3-llm-biases svg {
17
- width: 100%;
18
- height: auto;
19
- display: block;
20
- }
21
-
22
- .d3-llm-biases .card-rect {
23
- stroke-width: 2;
24
- transition: all 0.3s ease;
25
- }
26
-
27
- .d3-llm-biases .bias-title {
28
- fill: var(--text-color);
29
- font-size: 12px;
30
- font-weight: 700;
31
- }
32
-
33
- .d3-llm-biases .bias-description {
34
- fill: var(--text-color);
35
- font-size: 10px;
36
- font-weight: 400;
37
- line-height: 1.4;
38
- }
39
-
40
- .d3-llm-biases .header-text {
41
- fill: var(--text-color);
42
- font-size: 12px;
43
- font-weight: 700;
44
- text-transform: uppercase;
45
- letter-spacing: 0.05em;
46
- }
47
-
48
- .d3-llm-biases .example-label {
49
- fill: var(--muted-color);
50
- font-size: 9px;
51
- font-weight: 600;
52
- text-transform: uppercase;
53
- letter-spacing: 0.05em;
54
- }
55
-
56
- @media (max-width: 768px) {
57
- .d3-llm-biases .bias-title {
58
- font-size: 10px;
59
- }
60
-
61
- .d3-llm-biases .bias-description {
62
- font-size: 9px;
63
- }
64
- }
65
- </style>
66
-
67
- <script>
68
- (() => {
69
- const ensureD3 = (cb) => {
70
- if (window.d3 && typeof window.d3.select === 'function') return cb();
71
- let s = document.getElementById('d3-cdn-script');
72
- if (!s) {
73
- s = document.createElement('script');
74
- s.id = 'd3-cdn-script';
75
- s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
76
- document.head.appendChild(s);
77
- }
78
- const onReady = () => {
79
- if (window.d3 && typeof window.d3.select === 'function') cb();
80
- };
81
- s.addEventListener('load', onReady, { once: true });
82
- if (window.d3) onReady();
83
- };
84
-
85
- const bootstrap = () => {
86
- const scriptEl = document.currentScript;
87
- let container = scriptEl ? scriptEl.previousElementSibling : null;
88
- if (!(container && container.classList && container.classList.contains('d3-llm-biases'))) {
89
- const candidates = Array.from(document.querySelectorAll('.d3-llm-biases'))
90
- .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
91
- container = candidates[candidates.length - 1] || null;
92
- }
93
-
94
- if (!container) return;
95
-
96
- if (container.dataset) {
97
- if (container.dataset.mounted === 'true') return;
98
- container.dataset.mounted = 'true';
99
- }
100
-
101
- // Get colors from ColorPalettes or fallback
102
- const getColors = () => {
103
- if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
104
- return window.ColorPalettes.getColors('categorical', 8);
105
- }
106
- return ['#e74c3c', '#3498db', '#9b59b6', '#f39c12', '#1abc9c', '#e67e22', '#95a5a6', '#34495e'];
107
- };
108
-
109
- // LLM judge biases - first 4 for row 1, remaining 3 for row 2
110
- const biases = [
111
- {
112
- id: 'internal-consistency',
113
- title: 'No Internal Consistency',
114
- description: 'Gives different judgements if prompted multiple times (at T>0)',
115
- reference: null
116
- }, {
117
- id: 'inconsistent-score-range',
118
- title: 'No Consistent Score Ranges',
119
- description: 'Model ranking do not follow a consistent scale (e.g: for a task where scores should be 1, 2, 3, 4, ... 10, the model might score 1, 1, 1, 10, 10 ... 10)',
120
- reference: 'x.com/aparnadhinak/status/1748368364395721128',
121
- reference2: 'github.com/LeonEricsson/llmjudge'
122
- },
123
-
124
- {
125
- id: 'self-preference',
126
- title: 'Self-Preference',
127
- description: 'Judge will favor outputs from similar models when scoring',
128
- reference: 'arxiv.org/abs/2404.13076'
129
- },
130
- {
131
- id: 'input-perturbation',
132
- title: 'Blindness to Input Perturbation',
133
- description: 'If input is perturbed, judges don\'t detect quality drops consistently',
134
- reference: 'arxiv.org/abs/2406.13439'
135
- },
136
- {
137
- id: 'position-bias',
138
- title: 'Position Bias',
139
- description: 'When comparing answers, judge favors specific answer positions (e.g: systematically prefers first or second choice)',
140
- reference: 'arxiv.org/abs/2306.05685'
141
- },
142
- {
143
- id: 'verbosity-bias',
144
- title: 'Verbosity Bias',
145
- description: 'Models prefer more verbose answers',
146
- reference: 'arxiv.org/abs/2404.04475'
147
- },
148
- {
149
- id: 'human-consistency',
150
- title: 'No Consistency With Human Scoring',
151
- description: 'LLM ratings diverge from human ratings',
152
- reference: 'arxiv.org/abs/2308.15812'
153
- },
154
- {
155
- id: 'format-bias',
156
- title: 'Format Bias',
157
- description: 'Judge can\'t judge well when their prompt differs from their training prompt format',
158
- reference: 'arxiv.org/abs/2310.17631'
159
- }
160
- ];
161
-
162
- const svg = d3.select(container).append('svg');
163
- const g = svg.append('g');
164
-
165
- let width = 800;
166
- let height = 300;
167
-
168
- // Helper function to wrap text
169
- function wrapText(text, width) {
170
- text.each(function() {
171
- const text = d3.select(this);
172
- const words = text.text().split(/\s+/).reverse();
173
- let word;
174
- let line = [];
175
- let lineNumber = 0;
176
- const lineHeight = 1.3;
177
- const y = text.attr('y');
178
- const x = text.attr('x');
179
- const dy = parseFloat(text.attr('dy') || 0);
180
- let tspan = text.text(null).append('tspan')
181
- .attr('x', x)
182
- .attr('y', y)
183
- .attr('dy', dy + 'em');
184
-
185
- while ((word = words.pop())) {
186
- line.push(word);
187
- tspan.text(line.join(' '));
188
- if (tspan.node().getComputedTextLength() > width) {
189
- line.pop();
190
- tspan.text(line.join(' '));
191
- line = [word];
192
- tspan = text.append('tspan')
193
- .attr('x', x)
194
- .attr('y', y)
195
- .attr('dy', ++lineNumber * lineHeight + dy + 'em')
196
- .text(word);
197
- }
198
- }
199
- });
200
- }
201
-
202
- function render() {
203
- width = container.clientWidth || 800;
204
- height = Math.max(550, Math.round(width * 0.7));
205
-
206
- svg.attr('width', width).attr('height', height);
207
-
208
- const margin = { top: 40, right: 20, bottom: 20, left: 20 };
209
- const innerWidth = width - margin.left - margin.right;
210
- const innerHeight = height - margin.top - margin.bottom;
211
-
212
- g.attr('transform', `translate(${margin.left},${margin.top})`);
213
-
214
- // Clear previous content
215
- g.selectAll('*').remove();
216
-
217
- const colors = getColors();
218
-
219
- // Header
220
- g.append('text')
221
- .attr('class', 'header-text')
222
- .attr('x', innerWidth / 2)
223
- .attr('y', -15)
224
- .attr('text-anchor', 'middle')
225
- .text('LLM JUDGE BIASES');
226
-
227
- // Calculate card dimensions - 4 rows: 2 cards each
228
- const cols = 2;
229
- const rows = 4;
230
- const cardSpacingX = Math.min(20, innerWidth * 0.03);
231
- const cardSpacingY = Math.min(18, innerHeight * 0.04);
232
- const cardWidth = (innerWidth - cardSpacingX * (cols - 1)) / cols;
233
- const cardHeight = (innerHeight - cardSpacingY * (rows - 1)) / rows;
234
-
235
- // Draw cards in 4 rows (2 + 2 + 2 + 2)
236
- biases.forEach((bias, i) => {
237
- const row = Math.floor(i / 2);
238
- const col = i % 2;
239
-
240
- const x = col * (cardWidth + cardSpacingX);
241
- const y = row * (cardHeight + cardSpacingY);
242
-
243
- const cardGroup = g.append('g')
244
- .attr('transform', `translate(${x},${y})`);
245
-
246
- // Card background with frame
247
- cardGroup.append('rect')
248
- .attr('class', 'card-rect')
249
- .attr('width', cardWidth)
250
- .attr('height', cardHeight)
251
- .attr('rx', 12)
252
- .attr('fill', colors[i])
253
- .attr('fill-opacity', 0.12)
254
- .attr('stroke', colors[i])
255
- .attr('stroke-opacity', 0.6)
256
- .attr('stroke-width', 2);
257
-
258
- // Title
259
- cardGroup.append('text')
260
- .attr('class', 'bias-title')
261
- .attr('x', cardWidth / 2)
262
- .attr('y', 20)
263
- .attr('text-anchor', 'middle')
264
- .text(bias.title);
265
-
266
- // Description with wrapping
267
- const descText = cardGroup.append('text')
268
- .attr('class', 'bias-description')
269
- .attr('x', cardWidth / 2)
270
- .attr('y', 36)
271
- .attr('text-anchor', 'middle')
272
- .attr('dy', 0)
273
- .text(bias.description);
274
-
275
- wrapText(descText, cardWidth - 20);
276
-
277
- // Example box (only if there's an example)
278
- if (bias.example) {
279
- const exampleY = cardHeight - 55;
280
- const exampleHeight = 24;
281
-
282
- cardGroup.append('rect')
283
- .attr('x', 8)
284
- .attr('y', exampleY)
285
- .attr('width', cardWidth - 16)
286
- .attr('height', exampleHeight)
287
- .attr('rx', 4)
288
- .attr('fill', colors[i])
289
- .attr('fill-opacity', 0.15)
290
- .attr('stroke', colors[i])
291
- .attr('stroke-width', 1)
292
- .attr('stroke-opacity', 0.4);
293
-
294
- // Example text
295
- cardGroup.append('text')
296
- .attr('class', 'bias-description')
297
- .attr('x', cardWidth / 2)
298
- .attr('y', exampleY + 13)
299
- .attr('text-anchor', 'middle')
300
- .attr('dominant-baseline', 'middle')
301
- .attr('font-size', 9)
302
- .text(bias.example);
303
- }
304
-
305
- // Reference link (if exists)
306
- if (bias.reference) {
307
- const refY = bias.example ? cardHeight - 8 : cardHeight - 12;
308
- const refLink = cardGroup.append('a')
309
- .attr('href', `https://${bias.reference}`)
310
- .attr('target', '_blank')
311
- .attr('rel', 'noopener noreferrer');
312
-
313
- refLink.append('text')
314
- .attr('class', 'example-label')
315
- .attr('x', cardWidth - 10)
316
- .attr('y', bias.reference2 ? refY - 10 : refY)
317
- .attr('text-anchor', 'end')
318
- .attr('font-size', 8)
319
- .attr('fill', colors[i])
320
- .attr('opacity', 0.7)
321
- .style('cursor', 'pointer')
322
- .style('text-decoration', 'underline')
323
- .text(bias.reference)
324
- .on('mouseenter', function() {
325
- d3.select(this).attr('opacity', 1);
326
- })
327
- .on('mouseleave', function() {
328
- d3.select(this).attr('opacity', 0.7);
329
- });
330
- }
331
-
332
- // Second reference link (if exists)
333
- if (bias.reference2) {
334
- const refY = bias.example ? cardHeight - 8 : cardHeight - 12;
335
- const refLink2 = cardGroup.append('a')
336
- .attr('href', `https://${bias.reference2}`)
337
- .attr('target', '_blank')
338
- .attr('rel', 'noopener noreferrer');
339
-
340
- refLink2.append('text')
341
- .attr('class', 'example-label')
342
- .attr('x', cardWidth - 10)
343
- .attr('y', refY)
344
- .attr('text-anchor', 'end')
345
- .attr('font-size', 8)
346
- .attr('fill', colors[i])
347
- .attr('opacity', 0.7)
348
- .style('cursor', 'pointer')
349
- .style('text-decoration', 'underline')
350
- .text(bias.reference2)
351
- .on('mouseenter', function() {
352
- d3.select(this).attr('opacity', 1);
353
- })
354
- .on('mouseleave', function() {
355
- d3.select(this).attr('opacity', 0.7);
356
- });
357
- }
358
- });
359
- }
360
-
361
- render();
362
-
363
- // Responsive handling
364
- if (window.ResizeObserver) {
365
- const ro = new ResizeObserver(() => render());
366
- ro.observe(container);
367
- } else {
368
- window.addEventListener('resize', render);
369
- }
370
- };
371
-
372
- if (document.readyState === 'loading') {
373
- document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
374
- } else {
375
- ensureD3(bootstrap);
376
- }
377
- })();
378
- </script>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/src/content/embeds/d3-mmlu-heatmap.html CHANGED
@@ -173,44 +173,30 @@
173
  [43.6, 48.9, 49.5, 51.0, 51.3, 52.0, 52.8, 52.3] // DeciLM-7B
174
  ];
175
 
176
- // Colors: diverging palette (purple for low, yellow for high)
177
  const getDivergingColors = (count) => {
178
  try {
179
  if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
180
  return window.ColorPalettes.getColors('diverging', count);
181
  }
182
  } catch (_) { }
183
- // Fallback: diverging scale from purple (low) to yellow (high)
184
  const colors = [];
185
  for (let i = 0; i < count; i++) {
186
  const t = i / (count - 1);
187
- // Purple (dark) -> lighter purple -> green -> yellow
188
- if (t < 0.25) {
189
- // Dark purple to medium purple
190
- const r = Math.round(75 + (t / 0.25) * 50);
191
- const g = Math.round(0 + (t / 0.25) * 30);
192
- const b = Math.round(130 + (t / 0.25) * 50);
193
- colors.push(`rgb(${r}, ${g}, ${b})`);
194
- } else if (t < 0.5) {
195
- // Purple to blue-green
196
- const t2 = (t - 0.25) / 0.25;
197
- const r = Math.round(125 - t2 * 75);
198
- const g = Math.round(30 + t2 * 100);
199
- const b = Math.round(180 - t2 * 80);
200
- colors.push(`rgb(${r}, ${g}, ${b})`);
201
- } else if (t < 0.75) {
202
- // Blue-green to green
203
- const t2 = (t - 0.5) / 0.25;
204
- const r = Math.round(50 + t2 * 50);
205
- const g = Math.round(130 + t2 * 70);
206
- const b = Math.round(100 - t2 * 50);
207
  colors.push(`rgb(${r}, ${g}, ${b})`);
208
  } else {
209
- // Green to yellow
210
- const t2 = (t - 0.75) / 0.25;
211
- const r = Math.round(100 + t2 * 155);
212
- const g = Math.round(200 - t2 * 50);
213
- const b = Math.round(50 - t2 * 50);
214
  colors.push(`rgb(${r}, ${g}, ${b})`);
215
  }
216
  }
@@ -220,7 +206,7 @@
220
  const palette = getDivergingColors(10);
221
 
222
  let width = 900;
223
- const margin = { top: 10, right: 20, bottom: 20, left: 100 }; // Only left margin for model names
224
 
225
  function updateSize() {
226
  width = container.clientWidth || 900;
@@ -251,27 +237,8 @@
251
  }
252
 
253
  function getColorScale(values, minV, maxV) {
254
- const hasPalette = palette.length > 0;
255
- if (hasPalette && window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
256
- // Use quantile scale but with emphasis on extremes
257
- const sorted = [...values].sort((a, b) => a - b);
258
- const n = sorted.length;
259
- // Create custom quantiles that emphasize extremes
260
- const quantiles = [];
261
- for (let i = 0; i <= 10; i++) {
262
- const q = i / 10;
263
- // Apply a power transformation to emphasize extremes
264
- const transformedQ = q < 0.5
265
- ? Math.pow(q * 2, 1.5) / 2
266
- : 0.5 + Math.pow((q - 0.5) * 2, 1.5) / 2;
267
- const idx = Math.floor(transformedQ * (n - 1));
268
- quantiles.push(sorted[Math.min(idx, n - 1)]);
269
- }
270
- const scale = d3.scaleQuantile().domain(quantiles).range(palette);
271
- return (v) => scale(v);
272
- }
273
-
274
- // Fallback: non-linear scale that emphasizes extremes
275
  const linearScale = d3.scaleLinear()
276
  .domain([minV, maxV])
277
  .range([0, 1])
@@ -280,36 +247,30 @@
280
  return (v) => {
281
  const t = linearScale(v);
282
  // Apply power transformation to emphasize extremes
 
283
  let transformedT;
284
  if (t < 0.5) {
 
285
  transformedT = Math.pow(t * 2, 1.8) / 2;
286
  } else {
 
287
  transformedT = 0.5 + Math.pow((t - 0.5) * 2, 1.8) / 2;
288
  }
289
 
290
- // Purple (low) -> Green (mid) -> Yellow (high)
291
- if (transformedT < 0.25) {
292
- const r = Math.round(75 + (transformedT / 0.25) * 50);
293
- const g = Math.round(0 + (transformedT / 0.25) * 30);
294
- const b = Math.round(130 + (transformedT / 0.25) * 50);
295
- return `rgb(${r}, ${g}, ${b})`;
296
- } else if (transformedT < 0.5) {
297
- const t2 = (transformedT - 0.25) / 0.25;
298
- const r = Math.round(125 - t2 * 75);
299
- const g = Math.round(30 + t2 * 100);
300
- const b = Math.round(180 - t2 * 80);
301
- return `rgb(${r}, ${g}, ${b})`;
302
- } else if (transformedT < 0.75) {
303
- const t2 = (transformedT - 0.5) / 0.25;
304
- const r = Math.round(50 + t2 * 50);
305
- const g = Math.round(130 + t2 * 70);
306
- const b = Math.round(100 - t2 * 50);
307
  return `rgb(${r}, ${g}, ${b})`;
308
  } else {
309
- const t2 = (transformedT - 0.75) / 0.25;
310
- const r = Math.round(100 + t2 * 155);
311
- const g = Math.round(200 - t2 * 50);
312
- const b = Math.round(50 - t2 * 50);
 
313
  return `rgb(${r}, ${g}, ${b})`;
314
  }
315
  };
@@ -344,12 +305,12 @@
344
  const x = d3.scaleBand()
345
  .domain(d3.range(nCols))
346
  .range([0, gridWidth])
347
- .paddingInner(0.08);
348
 
349
  const y = d3.scaleBand()
350
  .domain(d3.range(nRows))
351
  .range([0, gridHeight])
352
- .paddingInner(0.08);
353
 
354
  // Flatten matrix data
355
  const flatData = [];
@@ -367,6 +328,20 @@
367
 
368
  gCells.attr('transform', `translate(${gridOffsetX}, ${gridOffsetY})`);
369
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
370
 
371
  const cells = gCells.selectAll('g.cell')
372
  .data(flatData, d => `${d.r}-${d.c}`);
@@ -376,8 +351,8 @@
376
  .attr('class', 'cell');
377
 
378
  cellsEnter.append('rect')
379
- .attr('rx', 3)
380
- .attr('ry', 3)
381
  .on('mousemove', (event, d) => {
382
  const [px, py] = d3.pointer(event, container);
383
  tipInner.innerHTML = `<strong>${d.model}</strong><br/>${d.format}<br/>Score: ${d.value.toFixed(1)}`;
@@ -400,9 +375,7 @@
400
  .attr('y', d => y(d.r))
401
  .attr('width', Math.max(1, x.bandwidth()))
402
  .attr('height', Math.max(1, y.bandwidth()))
403
- .attr('fill', d => colorScale(d.value))
404
- .attr('stroke', 'var(--border-color)')
405
- .attr('stroke-width', 0.5);
406
 
407
  cellsMerged.select('text')
408
  .attr('x', d => x(d.c) + x.bandwidth() / 2)
 
173
  [43.6, 48.9, 49.5, 51.0, 51.3, 52.0, 52.8, 52.3] // DeciLM-7B
174
  ];
175
 
176
+ // Colors: red to green palette (red for low, green for high)
177
  const getDivergingColors = (count) => {
178
  try {
179
  if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
180
  return window.ColorPalettes.getColors('diverging', count);
181
  }
182
  } catch (_) { }
183
+ // Fallback: red to green scale
184
  const colors = [];
185
  for (let i = 0; i < count; i++) {
186
  const t = i / (count - 1);
187
+ // Red (low) -> Yellow (mid) -> Green (high)
188
+ if (t < 0.5) {
189
+ // Red to yellow
190
+ const r = 255;
191
+ const g = Math.round(t * 2 * 255);
192
+ const b = 0;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  colors.push(`rgb(${r}, ${g}, ${b})`);
194
  } else {
195
+ // Yellow to green
196
+ const t2 = (t - 0.5) * 2;
197
+ const r = Math.round(255 - t2 * 255);
198
+ const g = 255;
199
+ const b = 0;
200
  colors.push(`rgb(${r}, ${g}, ${b})`);
201
  }
202
  }
 
206
  const palette = getDivergingColors(10);
207
 
208
  let width = 900;
209
+ const margin = { top: 0, right: 0, bottom: 0, left: 100 }; // Only left margin for model names
210
 
211
  function updateSize() {
212
  width = container.clientWidth || 900;
 
237
  }
238
 
239
  function getColorScale(values, minV, maxV) {
240
+ // Always use the custom red-to-green palette (fallback)
241
+ // Don't use ColorPalettes for this specific heatmap
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
  const linearScale = d3.scaleLinear()
243
  .domain([minV, maxV])
244
  .range([0, 1])
 
247
  return (v) => {
248
  const t = linearScale(v);
249
  // Apply power transformation to emphasize extremes
250
+ // Values near min/max get more extreme colors
251
  let transformedT;
252
  if (t < 0.5) {
253
+ // Compress lower values, making extremes more distinct
254
  transformedT = Math.pow(t * 2, 1.8) / 2;
255
  } else {
256
+ // Expand upper values, making extremes more distinct
257
  transformedT = 0.5 + Math.pow((t - 0.5) * 2, 1.8) / 2;
258
  }
259
 
260
+ // Red to green scale: red (low scores = bad) -> yellow (mid) -> green (high scores = good)
261
+ // Less flashy: reduce saturation
262
+ if (transformedT < 0.5) {
263
+ // Red to yellow (less saturated)
264
+ const r = 220;
265
+ const g = Math.round(80 + transformedT * 2 * 140);
266
+ const b = Math.round(60 + transformedT * 2 * 40);
 
 
 
 
 
 
 
 
 
 
267
  return `rgb(${r}, ${g}, ${b})`;
268
  } else {
269
+ // Yellow to green (less saturated)
270
+ const t2 = (transformedT - 0.5) * 2;
271
+ const r = Math.round(220 - t2 * 100);
272
+ const g = 220;
273
+ const b = Math.round(100 - t2 * 60);
274
  return `rgb(${r}, ${g}, ${b})`;
275
  }
276
  };
 
305
  const x = d3.scaleBand()
306
  .domain(d3.range(nCols))
307
  .range([0, gridWidth])
308
+ .paddingInner(0);
309
 
310
  const y = d3.scaleBand()
311
  .domain(d3.range(nRows))
312
  .range([0, gridHeight])
313
+ .paddingInner(0);
314
 
315
  // Flatten matrix data
316
  const flatData = [];
 
328
 
329
  gCells.attr('transform', `translate(${gridOffsetX}, ${gridOffsetY})`);
330
 
331
+ // Add rounded corners only on the outer edges of the matrix using clipPath
332
+ const cornerRadius = 6;
333
+ defs.selectAll('#matrix-clip').remove();
334
+ const clipPath = defs.append('clipPath')
335
+ .attr('id', 'matrix-clip');
336
+ clipPath.append('rect')
337
+ .attr('x', 0)
338
+ .attr('y', 0)
339
+ .attr('width', gridWidth)
340
+ .attr('height', gridHeight)
341
+ .attr('rx', cornerRadius)
342
+ .attr('ry', cornerRadius);
343
+
344
+ gCells.attr('clip-path', 'url(#matrix-clip)');
345
 
346
  const cells = gCells.selectAll('g.cell')
347
  .data(flatData, d => `${d.r}-${d.c}`);
 
351
  .attr('class', 'cell');
352
 
353
  cellsEnter.append('rect')
354
+ .attr('rx', 0)
355
+ .attr('ry', 0)
356
  .on('mousemove', (event, d) => {
357
  const [px, py] = d3.pointer(event, container);
358
  tipInner.innerHTML = `<strong>${d.model}</strong><br/>${d.format}<br/>Score: ${d.value.toFixed(1)}`;
 
375
  .attr('y', d => y(d.r))
376
  .attr('width', Math.max(1, x.bandwidth()))
377
  .attr('height', Math.max(1, y.bandwidth()))
378
+ .attr('fill', d => colorScale(d.value));
 
 
379
 
380
  cellsMerged.select('text')
381
  .attr('x', d => x(d.c) + x.bandwidth() / 2)
app/src/content/embeds/d3-sampling-metrics.html CHANGED
@@ -85,12 +85,6 @@
85
  letter-spacing: 0.05em;
86
  }
87
 
88
- .d3-sampling-metrics .section-title.sampling-metrics {
89
- stroke: var(--surface-bg);
90
- stroke-width: 10px;
91
- paint-order: stroke fill;
92
- }
93
-
94
  .d3-sampling-metrics .question-text {
95
  fill: var(--text-color);
96
  font-size: 14px;
@@ -287,11 +281,11 @@
287
 
288
  function render() {
289
  width = container.clientWidth || 800;
290
- height = Math.max(300, Math.round(width * 0.42));
291
 
292
  svg.attr('width', width).attr('height', height);
293
 
294
- const margin = { top: 50, right: 20, bottom: 20, left: 20 };
295
  const innerWidth = width - margin.left - margin.right;
296
  const innerHeight = height - margin.top - margin.bottom;
297
 
@@ -325,7 +319,7 @@
325
  const metricBoxHeight = 75;
326
 
327
  // Position samples in a row
328
- const samplesY = 40;
329
  const sampleSpacing = (innerWidth - sampleBoxWidth * samples.length) / (samples.length + 1);
330
 
331
  const sampleNodes = samples.map((d, i) => ({
@@ -352,10 +346,17 @@
352
  g.append('text')
353
  .attr('class', 'section-title')
354
  .attr('x', innerWidth / 2)
355
- .attr('y', samplesY - 20)
356
  .attr('text-anchor', 'middle')
357
  .text('5 SAMPLED GENERATIONS');
358
 
 
 
 
 
 
 
 
359
  // Draw connection lines from samples to metrics
360
  const linkGroup = g.append('g').attr('class', 'links');
361
 
@@ -483,14 +484,6 @@
483
  .attr('text-anchor', 'middle')
484
  .attr('fill', colors.metric)
485
  .text(d => d.result);
486
-
487
- // Ajouter "SAMPLING METRICS" en dernier pour qu'il soit au-dessus de tout
488
- g.append('text')
489
- .attr('class', 'section-title sampling-metrics')
490
- .attr('x', innerWidth / 2)
491
- .attr('y', metricsY - 20)
492
- .attr('text-anchor', 'middle')
493
- .text('SAMPLING METRICS');
494
  }
495
 
496
  render();
 
85
  letter-spacing: 0.05em;
86
  }
87
 
 
 
 
 
 
 
88
  .d3-sampling-metrics .question-text {
89
  fill: var(--text-color);
90
  font-size: 14px;
 
281
 
282
  function render() {
283
  width = container.clientWidth || 800;
284
+ height = Math.max(350, Math.round(width * 0.42));
285
 
286
  svg.attr('width', width).attr('height', height);
287
 
288
+ const margin = { top: 60, right: 20, bottom: 20, left: 20 };
289
  const innerWidth = width - margin.left - margin.right;
290
  const innerHeight = height - margin.top - margin.bottom;
291
 
 
319
  const metricBoxHeight = 75;
320
 
321
  // Position samples in a row
322
+ const samplesY = 20;
323
  const sampleSpacing = (innerWidth - sampleBoxWidth * samples.length) / (samples.length + 1);
324
 
325
  const sampleNodes = samples.map((d, i) => ({
 
346
  g.append('text')
347
  .attr('class', 'section-title')
348
  .attr('x', innerWidth / 2)
349
+ .attr('y', samplesY - 10)
350
  .attr('text-anchor', 'middle')
351
  .text('5 SAMPLED GENERATIONS');
352
 
353
+ g.append('text')
354
+ .attr('class', 'section-title')
355
+ .attr('x', innerWidth / 2)
356
+ .attr('y', metricsY - 10)
357
+ .attr('text-anchor', 'middle')
358
+ .text('SAMPLING METRICS');
359
+
360
  // Draw connection lines from samples to metrics
361
  const linkGroup = g.append('g').attr('class', 'links');
362
 
 
484
  .attr('text-anchor', 'middle')
485
  .attr('fill', colors.metric)
486
  .text(d => d.result);
 
 
 
 
 
 
 
 
487
  }
488
 
489
  render();
app/src/content/embeds/d3-text-metrics.html CHANGED
@@ -43,6 +43,10 @@
43
  transition: border-color 0.2s;
44
  }
45
 
 
 
 
 
46
  .d3-text-metrics .metric-name {
47
  font-size: 13px;
48
  font-weight: 600;
 
43
  transition: border-color 0.2s;
44
  }
45
 
46
+ .d3-text-metrics .metric-box:hover {
47
+ border-color: var(--primary-color);
48
+ }
49
+
50
  .d3-text-metrics .metric-name {
51
  font-size: 13px;
52
  font-weight: 600;
app/src/content/embeds/d3-tokenization-timeline.html CHANGED
@@ -1,5 +1,5 @@
1
  <div class="d3-prompt-evolution">
2
- <svg viewBox="0 0 900 370" xmlns="http://www.w3.org/2000/svg">
3
  <defs>
4
  <marker id="arrowhead-prompt" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
5
  <polygon points="0 0, 10 3, 0 6" fill="currentColor" />
 
1
  <div class="d3-prompt-evolution">
2
+ <svg viewBox="0 0 900 500" xmlns="http://www.w3.org/2000/svg">
3
  <defs>
4
  <marker id="arrowhead-prompt" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
5
  <polygon points="0 0, 10 3, 0 6" fill="currentColor" />
app/src/content/embeds/d3-vibe-checks.html DELETED
@@ -1,338 +0,0 @@
1
- <div class="d3-vibe-checks"></div>
2
-
3
- <style>
4
- .d3-vibe-checks {
5
- font-family: var(--default-font-family);
6
- background: transparent !important;
7
- border: none !important;
8
- border-radius: 0 !important;
9
- padding: var(--spacing-4) 0;
10
- width: 100%;
11
- margin: 0 auto;
12
- position: relative;
13
- box-shadow: none !important;
14
- }
15
-
16
- .d3-vibe-checks svg {
17
- width: 100%;
18
- height: auto;
19
- display: block;
20
- }
21
-
22
- .d3-vibe-checks .card-rect {
23
- stroke-width: 2;
24
- transition: all 0.3s ease;
25
- }
26
-
27
- .d3-vibe-checks .card-title {
28
- fill: var(--text-color);
29
- font-size: 13px;
30
- font-weight: 700;
31
- }
32
-
33
- .d3-vibe-checks .card-question {
34
- fill: var(--text-color);
35
- font-size: 12px;
36
- font-weight: 500;
37
- font-style: italic;
38
- }
39
-
40
- .d3-vibe-checks .card-label {
41
- fill: var(--muted-color);
42
- font-size: 10px;
43
- font-weight: 600;
44
- text-transform: uppercase;
45
- letter-spacing: 0.05em;
46
- }
47
-
48
- .d3-vibe-checks .header-text {
49
- fill: var(--text-color);
50
- font-size: 12px;
51
- font-weight: 700;
52
- text-transform: uppercase;
53
- letter-spacing: 0.05em;
54
- }
55
-
56
- @media (max-width: 768px) {
57
- .d3-vibe-checks .card-title {
58
- font-size: 11px;
59
- }
60
-
61
- .d3-vibe-checks .card-question {
62
- font-size: 10px;
63
- }
64
- }
65
- </style>
66
-
67
- <script>
68
- (() => {
69
- const ensureD3 = (cb) => {
70
- if (window.d3 && typeof window.d3.select === 'function') return cb();
71
- let s = document.getElementById('d3-cdn-script');
72
- if (!s) {
73
- s = document.createElement('script');
74
- s.id = 'd3-cdn-script';
75
- s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
76
- document.head.appendChild(s);
77
- }
78
- const onReady = () => {
79
- if (window.d3 && typeof window.d3.select === 'function') cb();
80
- };
81
- s.addEventListener('load', onReady, { once: true });
82
- if (window.d3) onReady();
83
- };
84
-
85
- const bootstrap = () => {
86
- const scriptEl = document.currentScript;
87
- let container = scriptEl ? scriptEl.previousElementSibling : null;
88
- if (!(container && container.classList && container.classList.contains('d3-vibe-checks'))) {
89
- const candidates = Array.from(document.querySelectorAll('.d3-vibe-checks'))
90
- .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
91
- container = candidates[candidates.length - 1] || null;
92
- }
93
-
94
- if (!container) return;
95
-
96
- if (container.dataset) {
97
- if (container.dataset.mounted === 'true') return;
98
- container.dataset.mounted = 'true';
99
- }
100
-
101
- // Get colors from ColorPalettes or fallback
102
- const getColors = () => {
103
- if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
104
- return window.ColorPalettes.getColors('categorical', 3);
105
- }
106
- return ['#1f77b4', '#ff7f0e', '#2ca02c'];
107
- };
108
-
109
- // Vibe-check examples
110
- const vibeChecks = [
111
- {
112
- id: 'strawberry',
113
- title: 'Letter Counting',
114
- question: 'How many "r"s in "strawberry"?',
115
- category: 'Reasoning',
116
- answers: [
117
- { label: 'Model A', text: '3', correct: true },
118
- { label: 'Model B', text: '2', correct: false }
119
- ]
120
- },
121
- {
122
- id: 'numbers',
123
- title: 'Number Comparison',
124
- question: 'Is 9.9 bigger or smaller than 9.11?',
125
- category: 'Math',
126
- answers: [
127
- { label: 'Model A', text: '9.9 < 9.11', correct: false },
128
- { label: 'Model B', text: '9.9 > 9.11', correct: true }
129
- ]
130
- },
131
- {
132
- id: 'tikz',
133
- title: 'Creative Generation',
134
- question: 'Draw a unicorn in TikZ',
135
- category: 'Coding',
136
- answers: [
137
- { label: 'Model A', text: '\\draw[...] unicorn', correct: true },
138
- { label: 'Model B', text: 'Error: invalid', correct: false }
139
- ]
140
- }
141
- ];
142
-
143
- // Function to draw model answers
144
- function drawAnswers(g, x, y, width, answers, color) {
145
- const answerHeight = 25;
146
- const answerSpacing = 8;
147
- const startY = y;
148
-
149
- answers.forEach((answer, i) => {
150
- const answerY = startY + i * (answerHeight + answerSpacing);
151
-
152
- // Answer box
153
- const boxGroup = g.append('g');
154
-
155
- boxGroup.append('rect')
156
- .attr('x', x - width / 2 + 10)
157
- .attr('y', answerY)
158
- .attr('width', width - 20)
159
- .attr('height', answerHeight)
160
- .attr('rx', 6)
161
- .attr('fill', color)
162
- .attr('fill-opacity', answer.correct ? 0.2 : 0.08)
163
- .attr('stroke', color)
164
- .attr('stroke-width', 1.5)
165
- .attr('stroke-opacity', answer.correct ? 0.6 : 0.3);
166
-
167
- // Combined label and answer text
168
- const labelText = answer.label + ': ';
169
- const combinedText = labelText + answer.text;
170
-
171
- boxGroup.append('text')
172
- .attr('x', x - width / 2 + 18)
173
- .attr('y', answerY + answerHeight / 2)
174
- .attr('dominant-baseline', 'middle')
175
- .attr('font-size', 11)
176
- .attr('fill', color)
177
- .html(() => {
178
- return `<tspan font-weight="600" opacity="0.8" font-size="10">${answer.label}: </tspan><tspan font-weight="${answer.correct ? 600 : 400}" opacity="${answer.correct ? 1 : 0.6}">${answer.text}</tspan>`;
179
- });
180
-
181
- // Checkmark or X
182
- if (answer.correct) {
183
- boxGroup.append('text')
184
- .attr('x', x + width / 2 - 28)
185
- .attr('y', answerY + answerHeight / 2)
186
- .attr('dominant-baseline', 'middle')
187
- .attr('font-size', 14)
188
- .attr('font-weight', 700)
189
- .attr('fill', color)
190
- .text('✓');
191
- } else {
192
- boxGroup.append('text')
193
- .attr('x', x + width / 2 - 28)
194
- .attr('y', answerY + answerHeight / 2)
195
- .attr('dominant-baseline', 'middle')
196
- .attr('font-size', 14)
197
- .attr('font-weight', 400)
198
- .attr('fill', color)
199
- .attr('opacity', 0.4)
200
- .text('✗');
201
- }
202
- });
203
- }
204
-
205
- const svg = d3.select(container).append('svg');
206
- const g = svg.append('g');
207
-
208
- let width = 800;
209
- let height = 300;
210
-
211
- // Helper function to wrap text
212
- function wrapText(text, width) {
213
- text.each(function() {
214
- const text = d3.select(this);
215
- const words = text.text().split(/\s+/).reverse();
216
- let word;
217
- let line = [];
218
- let lineNumber = 0;
219
- const lineHeight = 1.2;
220
- const y = text.attr('y');
221
- const x = text.attr('x');
222
- const dy = parseFloat(text.attr('dy') || 0);
223
- let tspan = text.text(null).append('tspan')
224
- .attr('x', x)
225
- .attr('y', y)
226
- .attr('dy', dy + 'em');
227
-
228
- while ((word = words.pop())) {
229
- line.push(word);
230
- tspan.text(line.join(' '));
231
- if (tspan.node().getComputedTextLength() > width) {
232
- line.pop();
233
- tspan.text(line.join(' '));
234
- line = [word];
235
- tspan = text.append('tspan')
236
- .attr('x', x)
237
- .attr('y', y)
238
- .attr('dy', ++lineNumber * lineHeight + dy + 'em')
239
- .text(word);
240
- }
241
- }
242
- });
243
- }
244
-
245
- function render() {
246
- width = container.clientWidth || 800;
247
- height = Math.max(250, Math.round(width * 0.4));
248
-
249
- svg.attr('width', width).attr('height', height);
250
-
251
- const margin = { top: 40, right: 20, bottom: 20, left: 20 };
252
- const innerWidth = width - margin.left - margin.right;
253
- const innerHeight = height - margin.top - margin.bottom;
254
-
255
- g.attr('transform', `translate(${margin.left},${margin.top})`);
256
-
257
- // Clear previous content
258
- g.selectAll('*').remove();
259
-
260
- const colors = getColors();
261
-
262
- // Header
263
- g.append('text')
264
- .attr('class', 'header-text')
265
- .attr('x', innerWidth / 2)
266
- .attr('y', -15)
267
- .attr('text-anchor', 'middle')
268
- .text('VIBE-CHECK EXAMPLES');
269
-
270
- // Calculate card dimensions
271
- const cardSpacing = Math.min(20, innerWidth * 0.03);
272
- const cardWidth = (innerWidth - cardSpacing * 2) / 3;
273
- const cardHeight = innerHeight * 0.85;
274
- const cardY = innerHeight * 0.1;
275
-
276
- // Draw cards
277
- vibeChecks.forEach((check, i) => {
278
- const x = i * (cardWidth + cardSpacing);
279
-
280
- const cardGroup = g.append('g')
281
- .attr('transform', `translate(${x},${cardY})`);
282
-
283
- // Card background with frame
284
- cardGroup.append('rect')
285
- .attr('class', 'card-rect')
286
- .attr('width', cardWidth)
287
- .attr('height', cardHeight)
288
- .attr('rx', 12)
289
- .attr('fill', colors[i])
290
- .attr('fill-opacity', 0.12)
291
- .attr('stroke', colors[i])
292
- .attr('stroke-opacity', 0.6)
293
- .attr('stroke-width', 2);
294
-
295
- // Title
296
- cardGroup.append('text')
297
- .attr('class', 'card-title')
298
- .attr('x', cardWidth / 2)
299
- .attr('y', 25)
300
- .attr('text-anchor', 'middle')
301
- .text(check.title);
302
-
303
- // Question with wrapping
304
- const questionText = cardGroup.append('text')
305
- .attr('class', 'card-question')
306
- .attr('x', cardWidth / 2)
307
- .attr('y', 45)
308
- .attr('text-anchor', 'middle')
309
- .attr('dy', 0)
310
- .text(check.question);
311
-
312
- // Apply text wrapping
313
- wrapText(questionText, cardWidth - 30);
314
-
315
- // Model answers
316
- const answersY = cardHeight * 0.5;
317
- drawAnswers(cardGroup, cardWidth / 2, answersY, cardWidth, check.answers, colors[i]);
318
- });
319
- }
320
-
321
- render();
322
-
323
- // Responsive handling
324
- if (window.ResizeObserver) {
325
- const ro = new ResizeObserver(() => render());
326
- ro.observe(container);
327
- } else {
328
- window.addEventListener('resize', render);
329
- }
330
- };
331
-
332
- if (document.readyState === 'loading') {
333
- document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
334
- } else {
335
- ensureD3(bootstrap);
336
- }
337
- })();
338
- </script>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/src/styles/components/_table.css CHANGED
@@ -10,7 +10,7 @@
10
  border-bottom: 1px solid var(--border-color);
11
  padding: 6px 8px;
12
  font-size: 15px;
13
- /* white-space: nowrap; */
14
  /* prevent squashing; allow horizontal scroll instead */
15
  word-break: auto-phrase;
16
  /* white-space: break-spaces; */
 
10
  border-bottom: 1px solid var(--border-color);
11
  padding: 6px 8px;
12
  font-size: 15px;
13
+ white-space: nowrap;
14
  /* prevent squashing; allow horizontal scroll instead */
15
  word-break: auto-phrase;
16
  /* white-space: break-spaces; */