Clémentine commited on
Commit
112a899
·
1 Parent(s): 9a4bbbe

fix + more evals + figure

Browse files
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -183,11 +183,13 @@ When models generate outputs, sampling multiple times and aggregating results ca
183
  This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
184
 
185
  Common sampling-based metrics are:
186
- - **pass@k over n**: Given n generated samples, measures whether at least k passes the test.
187
- <Sidenote> You'll find two functions for this metric: computed trivially as: $\text{pass}@k = (c >= k)$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
188
  - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
189
- - **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
190
- - **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
 
 
191
 
192
  When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
193
 
@@ -546,7 +548,9 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
546
  - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
547
  </Note>
548
 
549
- ### Confidence and score reporting
 
 
550
 
551
  When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
552
 
@@ -554,3 +558,5 @@ These confidence intervals from the raw scores can be obtained from standard dev
554
 
555
  You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
556
 
 
 
 
183
  This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
184
 
185
  Common sampling-based metrics are:
186
+ - **pass@k over n**: Given n generated samples, measures whether at least k passes the test. <Sidenote> You'll find two functions for pass@k: computed trivially as: $\text{pass}@k = (c >= k)$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
187
+
188
  - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
189
+ - **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
190
+ - **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
191
+
192
+ <HtmlEmbed src="d3-sampling-metrics.html" title="Sampling metrics comparison" />
193
 
194
  When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
195
 
 
548
  - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
549
  </Note>
550
 
551
+ ### The forgotten children of evaluation
552
+
553
+ #### Confidence
554
 
555
  When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
556
 
 
558
 
559
  You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
560
 
561
+ #### Cost
562
+
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED
@@ -61,6 +61,8 @@ For post training, you want more holistic evaluations, and a couple benchmarks m
61
 
62
  [**SweBench**](https://openreview.net/pdf?id=VTF8yNQM66) (2024) is a more well known and complete version of this, also using github, but this time testing if models can solve existing issues, so logic understanding, cross file editing and execution, long context reasoning, etc.
63
 
 
 
64
  At this time, I would recommend following LiveCodeBench, AiderBench and the higher quality subset of SWE-Bench (SWE-Bench verified), and reading the [METR report](https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/) on actual code assistant usefulness.
65
 
66
  #### Long context
@@ -98,6 +100,7 @@ Lastly, with the creation of MCPs, some benchmarks arose to test MCP oriented to
98
 
99
  [**LiveMCPBench**](https://arxiv.org/abs/2508.01780) (2025) provides a large locally deployable collection of MCP servers to test how good models are at discriminating between tools to accomplish tasks. Best models are already reaching 80% - so we're close to saturation. However, testing if models can select proper tools in very long lists is a good use case which will be increasingly important as the web goes mcp.
100
 
 
101
  (By the way, here's a cool [doc](https://www.anthropic.com/engineering/writing-tools-for-agents) on how to write good tools.)
102
 
103
  While testing individual capabilities provides valuable signal, real-world assistant performance comes from how these capabilities combine. A model might excel at reasoning but fail when that reasoning must be integrated with tool calling and long context management simultaneously, so we need evaluations requiring the orchestration of multiple capabilities together.
@@ -134,7 +137,8 @@ The most famous formal evaluation among these is probably [ARC-AGI](https://arcp
134
 
135
  The community and model providers have explored a number of existing games with LLMs. Single player adventure games/RPGs like [TextQuests](https://huggingface.co/blog/textquests) (2025) or [Pokemon](https://github.com/benchflow-ai/benchflow/tree/main/libs/pokemon-gym) (2024) (Twitch for [Claude](https://www.twitch.tv/claudeplayspokemon) and [Gemini](https://www.twitch.tv/gemini_plays_pokemon) for ex) require a combination of very long range planning to get objectives, which require adequante long context memory management, reasoning, and backtracking abilities. Same abilities are needed for single player survival games like [Crafter](https://arxiv.org/abs/2109.06780) (2021, Minecraft inspired). A number of single player game environments have been integrated into the [Balrog](https://arxiv.org/pdf/2411.13543) (2024) benchmark.
136
 
137
- Competitive bluffing games like [Poker](https://arxiv.org/html/2501.08328v1) (2025) or Mafia variations like [Town of Salem](https://github.com/summersonnn/Town-Of-Salem-with-LLMs) (2025) and Werewolf (2025, [here](https://arxiv.org/abs/2407.13943)/[there](https://werewolf.foaster.ai/)) are very interesting to test logic, reasoning, as well as deception abilities. Claude Opus 4 is for example incapable of winning Town of Salem as a vampire (deceptive role) but does well as a peasant (non deceptive role). Cooperative games like Hanabi can also be used to test adaptability and communication ability in a constrained environment.
 
138
 
139
  What's also very neat about these is that they have a single and unambiguous pass/fail metric: did the LLM win the game or not? At the moment, if I were to use these to evaluate models I would probably look at TextQuests for abilities and Town of Salem for safety.
140
 
@@ -149,16 +153,16 @@ In the last year, a new category of impossible to contaminate tasks emerged: for
149
 
150
  A similar approach is used to generate questions in [Arbitrage](https://arxiv.org/pdf/2412.18544), the core difference being the time horizon: events there should be resolved in 2028.
151
 
152
- In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
153
 
154
  <Note title="TLDR" emoji="🎯">
155
  The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
156
 
157
  As of Nov 2025, I recommend using:
158
 
159
- - **Core capabilities** (for model builders): Old capabilities evals for training, and for post training MATH500/AIME24, GPQA, IFEval, SWE-Bench, a long range eval of your choice like HELMET, TauBench or BFCL if you're targetting tool use
160
  - **Core capabilities** (for comparing models at inference): IFBench, HLE, MathArena, AiderBench and LiveCodeBench, MCP-Universe
161
- - **Long horizon tasks** (for real-world performance): GAIA, DABStep, SciCode, or domain specific evaluations for your use cases
162
  - **Games** (for some extra fun in measuring robustness and adaptability): ARC-AGI3 when it's out, TextQuests, Town of Salem if you're interested in safety, or any other game you like which goes beyond Poker/Chess/Go.
163
 
164
  The field is moving toward evaluations that test capability orchestration rather than isolated skills for actual use. This matches our goal of building models that "work well"—systems that can reliably combine core capabilities, tool use, with a good orchestration to solve actual problems.
 
61
 
62
  [**SweBench**](https://openreview.net/pdf?id=VTF8yNQM66) (2024) is a more well known and complete version of this, also using github, but this time testing if models can solve existing issues, so logic understanding, cross file editing and execution, long context reasoning, etc.
63
 
64
+ [**CodeClash**](https://codeclash.ai/) (2025) is the coding version of an arena, where models write code which competes against other models code, edit, and iterate.
65
+
66
  At this time, I would recommend following LiveCodeBench, AiderBench and the higher quality subset of SWE-Bench (SWE-Bench verified), and reading the [METR report](https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/) on actual code assistant usefulness.
67
 
68
  #### Long context
 
100
 
101
  [**LiveMCPBench**](https://arxiv.org/abs/2508.01780) (2025) provides a large locally deployable collection of MCP servers to test how good models are at discriminating between tools to accomplish tasks. Best models are already reaching 80% - so we're close to saturation. However, testing if models can select proper tools in very long lists is a good use case which will be increasingly important as the web goes mcp.
102
 
103
+
104
  (By the way, here's a cool [doc](https://www.anthropic.com/engineering/writing-tools-for-agents) on how to write good tools.)
105
 
106
  While testing individual capabilities provides valuable signal, real-world assistant performance comes from how these capabilities combine. A model might excel at reasoning but fail when that reasoning must be integrated with tool calling and long context management simultaneously, so we need evaluations requiring the orchestration of multiple capabilities together.
 
137
 
138
  The community and model providers have explored a number of existing games with LLMs. Single player adventure games/RPGs like [TextQuests](https://huggingface.co/blog/textquests) (2025) or [Pokemon](https://github.com/benchflow-ai/benchflow/tree/main/libs/pokemon-gym) (2024) (Twitch for [Claude](https://www.twitch.tv/claudeplayspokemon) and [Gemini](https://www.twitch.tv/gemini_plays_pokemon) for ex) require a combination of very long range planning to get objectives, which require adequante long context memory management, reasoning, and backtracking abilities. Same abilities are needed for single player survival games like [Crafter](https://arxiv.org/abs/2109.06780) (2021, Minecraft inspired). A number of single player game environments have been integrated into the [Balrog](https://arxiv.org/pdf/2411.13543) (2024) benchmark.
139
 
140
+ Competitive bluffing games like [Poker](https://arxiv.org/html/2501.08328v1) (2025), Mafia variations like [Town of Salem](https://github.com/summersonnn/Town-Of-Salem-with-LLMs) (2025) and Werewolf (2025, [here](https://arxiv.org/abs/2407.13943)/[there](https://werewolf.foaster.ai/)), or [Among us](antimlabs.com/amongais
141
+ ) are very interesting to test logic, reasoning, as well as deception abilities. Claude Opus 4 is for example incapable of winning Town of Salem as a vampire (deceptive role) but does well as a peasant (non deceptive role). Cooperative games like [Hanabi](https://arxiv.org/abs/2510.04980) can also be used to test adaptability and communication ability in a constrained environment.
142
 
143
  What's also very neat about these is that they have a single and unambiguous pass/fail metric: did the LLM win the game or not? At the moment, if I were to use these to evaluate models I would probably look at TextQuests for abilities and Town of Salem for safety.
144
 
 
153
 
154
  A similar approach is used to generate questions in [Arbitrage](https://arxiv.org/pdf/2412.18544), the core difference being the time horizon: events there should be resolved in 2028.
155
 
156
+ In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
157
 
158
  <Note title="TLDR" emoji="🎯">
159
  The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
160
 
161
  As of Nov 2025, I recommend using:
162
 
163
+ - **Core capabilities** (for model builders): Old capabilities evals for training, and for post training AIME26 when it will come out, GPQA, IFEval, SWE-Bench, a long range eval of your choice like HELMET, TauBench or BFCL if you're targetting tool use
164
  - **Core capabilities** (for comparing models at inference): IFBench, HLE, MathArena, AiderBench and LiveCodeBench, MCP-Universe
165
+ - **Long horizon tasks** (for real-world performance): GAIA2, DABStep, SciCode, or domain specific evaluations for your use cases
166
  - **Games** (for some extra fun in measuring robustness and adaptability): ARC-AGI3 when it's out, TextQuests, Town of Salem if you're interested in safety, or any other game you like which goes beyond Poker/Chess/Go.
167
 
168
  The field is moving toward evaluations that test capability orchestration rather than isolated skills for actual use. This matches our goal of building models that "work well"—systems that can reliably combine core capabilities, tool use, with a good orchestration to solve actual problems.
app/src/content/embeds/d3-sampling-metrics.html ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-sampling-metrics"></div>
2
+
3
+ <style>
4
+ .d3-sampling-metrics {
5
+ font-family: var(--default-font-family);
6
+ background: transparent;
7
+ border: none;
8
+ border-radius: 0;
9
+ padding: var(--spacing-4) 0;
10
+ width: 100%;
11
+ margin: 0 auto;
12
+ position: relative;
13
+ }
14
+
15
+ .d3-sampling-metrics svg {
16
+ width: 100%;
17
+ height: auto;
18
+ display: block;
19
+ }
20
+
21
+ .d3-sampling-metrics .sample-box {
22
+ stroke-width: 2;
23
+ transition: all 0.3s ease;
24
+ }
25
+
26
+ .d3-sampling-metrics .sample-box:hover {
27
+ filter: brightness(1.1);
28
+ stroke-width: 3;
29
+ }
30
+
31
+ .d3-sampling-metrics .metric-box {
32
+ stroke-width: 2;
33
+ transition: all 0.3s ease;
34
+ }
35
+
36
+ .d3-sampling-metrics .metric-box:hover {
37
+ filter: brightness(1.1);
38
+ stroke-width: 3;
39
+ }
40
+
41
+ .d3-sampling-metrics .sample-label {
42
+ fill: var(--text-color);
43
+ font-size: 11px;
44
+ font-weight: 600;
45
+ pointer-events: none;
46
+ user-select: none;
47
+ }
48
+
49
+ .d3-sampling-metrics .sample-answer {
50
+ fill: var(--text-color);
51
+ font-size: 10px;
52
+ font-weight: 500;
53
+ pointer-events: none;
54
+ user-select: none;
55
+ }
56
+
57
+ .d3-sampling-metrics .metric-label {
58
+ fill: var(--text-color);
59
+ font-size: 13px;
60
+ font-weight: 600;
61
+ pointer-events: none;
62
+ user-select: none;
63
+ }
64
+
65
+ .d3-sampling-metrics .metric-description {
66
+ fill: var(--muted-color);
67
+ font-size: 10px;
68
+ font-weight: 500;
69
+ pointer-events: none;
70
+ user-select: none;
71
+ }
72
+
73
+ .d3-sampling-metrics .metric-result {
74
+ font-size: 16px;
75
+ font-weight: 700;
76
+ pointer-events: none;
77
+ user-select: none;
78
+ }
79
+
80
+ .d3-sampling-metrics .section-title {
81
+ fill: var(--text-color);
82
+ font-size: 12px;
83
+ font-weight: 700;
84
+ text-transform: uppercase;
85
+ letter-spacing: 0.05em;
86
+ }
87
+
88
+ .d3-sampling-metrics .question-text {
89
+ fill: var(--text-color);
90
+ font-size: 14px;
91
+ font-weight: 600;
92
+ }
93
+
94
+ .d3-sampling-metrics .link-line {
95
+ fill: none;
96
+ stroke-width: 1.5;
97
+ transition: all 0.3s ease;
98
+ opacity: 0.3;
99
+ }
100
+
101
+ .d3-sampling-metrics .marker {
102
+ opacity: 0.5;
103
+ }
104
+
105
+ .d3-sampling-metrics .d3-tooltip {
106
+ position: absolute;
107
+ background: var(--surface-bg);
108
+ border: 1px solid var(--border-color);
109
+ border-radius: 8px;
110
+ padding: 8px 10px;
111
+ font-size: 12px;
112
+ pointer-events: none;
113
+ opacity: 0;
114
+ transition: opacity 0.12s ease;
115
+ box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
116
+ z-index: 1000;
117
+ max-width: 350px;
118
+ line-height: 1.35;
119
+ white-space: pre-line;
120
+ color: var(--text-color);
121
+ transform: translate(-9999px, -9999px);
122
+ }
123
+
124
+ @media (max-width: 768px) {
125
+ .d3-sampling-metrics .sample-label {
126
+ font-size: 10px;
127
+ }
128
+
129
+ .d3-sampling-metrics .sample-answer {
130
+ font-size: 9px;
131
+ }
132
+
133
+ .d3-sampling-metrics .metric-label {
134
+ font-size: 11px;
135
+ }
136
+
137
+ .d3-sampling-metrics .metric-description {
138
+ font-size: 9px;
139
+ }
140
+
141
+ .d3-sampling-metrics .metric-result {
142
+ font-size: 14px;
143
+ }
144
+ }
145
+ </style>
146
+
147
+ <script>
148
+ (() => {
149
+ const ensureD3 = (cb) => {
150
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
151
+ let s = document.getElementById('d3-cdn-script');
152
+ if (!s) {
153
+ s = document.createElement('script');
154
+ s.id = 'd3-cdn-script';
155
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
156
+ document.head.appendChild(s);
157
+ }
158
+ const onReady = () => {
159
+ if (window.d3 && typeof window.d3.select === 'function') cb();
160
+ };
161
+ s.addEventListener('load', onReady, { once: true });
162
+ if (window.d3) onReady();
163
+ };
164
+
165
+ const bootstrap = () => {
166
+ const scriptEl = document.currentScript;
167
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
168
+ if (!(container && container.classList && container.classList.contains('d3-sampling-metrics'))) {
169
+ const candidates = Array.from(document.querySelectorAll('.d3-sampling-metrics'))
170
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
171
+ container = candidates[candidates.length - 1] || null;
172
+ }
173
+
174
+ if (!container) return;
175
+
176
+ if (container.dataset) {
177
+ if (container.dataset.mounted === 'true') return;
178
+ container.dataset.mounted = 'true';
179
+ }
180
+
181
+ container.style.position = container.style.position || 'relative';
182
+
183
+ // Tooltip
184
+ let tip = container.querySelector('.d3-tooltip');
185
+ let tipInner;
186
+ if (!tip) {
187
+ tip = document.createElement('div');
188
+ tip.className = 'd3-tooltip';
189
+ tipInner = document.createElement('div');
190
+ tipInner.className = 'd3-tooltip__inner';
191
+ tipInner.style.textAlign = 'left';
192
+ tip.appendChild(tipInner);
193
+ container.appendChild(tip);
194
+ } else {
195
+ tipInner = tip.querySelector('.d3-tooltip__inner') || tip;
196
+ }
197
+
198
+ // Get colors from ColorPalettes or fallback
199
+ const getColors = () => {
200
+ if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
201
+ const cat = window.ColorPalettes.getColors('categorical', 5);
202
+ return {
203
+ correct: cat[2], // Green-ish
204
+ incorrect: cat[0], // Red-ish
205
+ metric: cat[4]
206
+ };
207
+ }
208
+ // Fallback to CSS variable based colors
209
+ const primaryColor = getComputedStyle(document.documentElement).getPropertyValue('--primary-color').trim() || '#6D4AFF';
210
+ return {
211
+ correct: '#4CAF50',
212
+ incorrect: '#F44336',
213
+ metric: primaryColor
214
+ };
215
+ };
216
+
217
+ // Example: Math problem "What is 15 + 27?"
218
+ // Correct answer: 42
219
+ // Samples with different answers
220
+ const samples = [
221
+ { id: 1, answer: '42', correct: true },
222
+ { id: 2, answer: '42', correct: true },
223
+ { id: 3, answer: '43', correct: false },
224
+ { id: 4, answer: '42', correct: true },
225
+ { id: 5, answer: '41', correct: false }
226
+ ];
227
+
228
+ const metrics = [
229
+ {
230
+ id: 'pass@1',
231
+ label: 'pass@1',
232
+ description: 'At least 1 correct',
233
+ result: '✓',
234
+ explanation: 'At least 1 of 5 samples is correct (we have 3 correct samples)',
235
+ usedSamples: [1]
236
+ },
237
+ {
238
+ id: 'pass@3',
239
+ label: 'pass@3',
240
+ description: 'At least 3 correct',
241
+ result: '✓',
242
+ explanation: 'At least 3 of 5 samples are correct (exactly 3 correct)',
243
+ usedSamples: [1, 2, 4]
244
+ },
245
+ {
246
+ id: 'maj@5',
247
+ label: 'maj@5',
248
+ description: 'Most frequent answer',
249
+ result: '42',
250
+ explanation: 'Most common answer: 42 appears 3 times vs 43 (1x) and 41 (1x)',
251
+ usedSamples: [1, 2, 3, 4, 5]
252
+ },
253
+ {
254
+ id: 'avg@5',
255
+ label: 'avg@5',
256
+ description: 'Average score',
257
+ result: '0.60',
258
+ explanation: 'Average correctness: 3 correct / 5 total = 0.60',
259
+ usedSamples: [1, 2, 3, 4, 5]
260
+ }
261
+ ];
262
+
263
+ const svg = d3.select(container).append('svg');
264
+ const g = svg.append('g');
265
+
266
+ // Arrow marker
267
+ svg.append('defs').append('marker')
268
+ .attr('id', 'arrow-sampling')
269
+ .attr('viewBox', '0 -5 10 10')
270
+ .attr('refX', 8)
271
+ .attr('refY', 0)
272
+ .attr('markerWidth', 5)
273
+ .attr('markerHeight', 5)
274
+ .attr('orient', 'auto')
275
+ .append('path')
276
+ .attr('d', 'M0,-5L10,0L0,5')
277
+ .attr('class', 'marker');
278
+
279
+ let width = 800;
280
+ let height = 500;
281
+
282
+ function render() {
283
+ width = container.clientWidth || 800;
284
+ height = Math.max(350, Math.round(width * 0.42));
285
+
286
+ svg.attr('width', width).attr('height', height);
287
+
288
+ const margin = { top: 60, right: 20, bottom: 20, left: 20 };
289
+ const innerWidth = width - margin.left - margin.right;
290
+ const innerHeight = height - margin.top - margin.bottom;
291
+
292
+ g.attr('transform', `translate(${margin.left},${margin.top})`);
293
+
294
+ // Clear previous content
295
+ g.selectAll('*').remove();
296
+
297
+ const colors = getColors();
298
+
299
+ // Question at the top
300
+ g.append('text')
301
+ .attr('class', 'question-text')
302
+ .attr('x', innerWidth / 2)
303
+ .attr('y', -35)
304
+ .attr('text-anchor', 'middle')
305
+ .text('Question: What is 15 + 27?');
306
+
307
+ g.append('text')
308
+ .attr('x', innerWidth / 2)
309
+ .attr('y', -18)
310
+ .attr('text-anchor', 'middle')
311
+ .attr('font-size', '11px')
312
+ .attr('fill', 'var(--muted-color)')
313
+ .text('(Correct answer: 42)');
314
+
315
+ // Layout
316
+ const sampleBoxWidth = Math.min(80, innerWidth * 0.12);
317
+ const sampleBoxHeight = 60;
318
+ const metricBoxWidth = Math.min(140, innerWidth * 0.22);
319
+ const metricBoxHeight = 75;
320
+
321
+ // Position samples in a row
322
+ const samplesY = 20;
323
+ const sampleSpacing = (innerWidth - sampleBoxWidth * samples.length) / (samples.length + 1);
324
+
325
+ const sampleNodes = samples.map((d, i) => ({
326
+ ...d,
327
+ x: sampleSpacing + i * (sampleBoxWidth + sampleSpacing),
328
+ y: samplesY,
329
+ width: sampleBoxWidth,
330
+ height: sampleBoxHeight
331
+ }));
332
+
333
+ // Position metrics below
334
+ const metricsY = samplesY + sampleBoxHeight + 60;
335
+ const metricSpacing = (innerWidth - metricBoxWidth * metrics.length) / (metrics.length + 1);
336
+
337
+ const metricNodes = metrics.map((d, i) => ({
338
+ ...d,
339
+ x: metricSpacing + i * (metricBoxWidth + metricSpacing),
340
+ y: metricsY,
341
+ width: metricBoxWidth,
342
+ height: metricBoxHeight
343
+ }));
344
+
345
+ // Section titles
346
+ g.append('text')
347
+ .attr('class', 'section-title')
348
+ .attr('x', innerWidth / 2)
349
+ .attr('y', samplesY - 10)
350
+ .attr('text-anchor', 'middle')
351
+ .text('5 SAMPLED GENERATIONS');
352
+
353
+ g.append('text')
354
+ .attr('class', 'section-title')
355
+ .attr('x', innerWidth / 2)
356
+ .attr('y', metricsY - 10)
357
+ .attr('text-anchor', 'middle')
358
+ .text('SAMPLING METRICS');
359
+
360
+ // Draw connection lines from samples to metrics
361
+ const linkGroup = g.append('g').attr('class', 'links');
362
+
363
+ metricNodes.forEach(metric => {
364
+ metric.usedSamples.forEach(sampleId => {
365
+ const sample = sampleNodes.find(s => s.id === sampleId);
366
+ if (sample) {
367
+ const sx = sample.x + sample.width / 2;
368
+ const sy = sample.y + sample.height;
369
+ const tx = metric.x + metric.width / 2;
370
+ const ty = metric.y;
371
+
372
+ linkGroup.append('line')
373
+ .attr('class', 'link-line')
374
+ .attr('x1', sx)
375
+ .attr('y1', sy)
376
+ .attr('x2', tx)
377
+ .attr('y2', ty)
378
+ .attr('stroke', colors.metric);
379
+ }
380
+ });
381
+ });
382
+
383
+ // Draw sample boxes
384
+ const sampleGroup = g.append('g').attr('class', 'samples');
385
+
386
+ const sampleBoxes = sampleGroup.selectAll('.sample')
387
+ .data(sampleNodes)
388
+ .join('g')
389
+ .attr('class', 'sample')
390
+ .attr('transform', d => `translate(${d.x},${d.y})`);
391
+
392
+ sampleBoxes.append('rect')
393
+ .attr('class', 'sample-box')
394
+ .attr('width', d => d.width)
395
+ .attr('height', d => d.height)
396
+ .attr('rx', 6)
397
+ .attr('fill', d => d.correct ? colors.correct : colors.incorrect)
398
+ .attr('fill-opacity', 0.3)
399
+ .attr('stroke', d => d.correct ? colors.correct : colors.incorrect)
400
+ .style('cursor', 'pointer')
401
+ .on('mouseenter', function(event, d) {
402
+ const status = d.correct ? 'Correct ✓' : 'Incorrect ✗';
403
+ tipInner.textContent = `Sample ${d.id}: "${d.answer}"\n${status}`;
404
+ tip.style.opacity = '1';
405
+ const [mx, my] = d3.pointer(event, container);
406
+ tip.style.transform = `translate(${mx + 10}px, ${my + 10}px)`;
407
+ })
408
+ .on('mouseleave', function() {
409
+ tip.style.opacity = '0';
410
+ tip.style.transform = 'translate(-9999px, -9999px)';
411
+ });
412
+
413
+ sampleBoxes.append('text')
414
+ .attr('class', 'sample-label')
415
+ .attr('x', d => d.width / 2)
416
+ .attr('y', 18)
417
+ .attr('text-anchor', 'middle')
418
+ .text(d => `#${d.id}`);
419
+
420
+ sampleBoxes.append('text')
421
+ .attr('class', 'sample-answer')
422
+ .attr('x', d => d.width / 2)
423
+ .attr('y', 35)
424
+ .attr('text-anchor', 'middle')
425
+ .attr('font-size', '14px')
426
+ .attr('font-weight', '700')
427
+ .text(d => d.answer);
428
+
429
+ sampleBoxes.append('text')
430
+ .attr('class', 'sample-label')
431
+ .attr('x', d => d.width / 2)
432
+ .attr('y', 50)
433
+ .attr('text-anchor', 'middle')
434
+ .attr('font-size', '10px')
435
+ .text(d => d.correct ? '✓' : '✗');
436
+
437
+ // Draw metric boxes
438
+ const metricGroup = g.append('g').attr('class', 'metrics');
439
+
440
+ const metricBoxes = metricGroup.selectAll('.metric')
441
+ .data(metricNodes)
442
+ .join('g')
443
+ .attr('class', 'metric')
444
+ .attr('transform', d => `translate(${d.x},${d.y})`);
445
+
446
+ metricBoxes.append('rect')
447
+ .attr('class', 'metric-box')
448
+ .attr('width', d => d.width)
449
+ .attr('height', d => d.height)
450
+ .attr('rx', 8)
451
+ .attr('fill', colors.metric)
452
+ .attr('fill-opacity', 0.35)
453
+ .attr('stroke', colors.metric)
454
+ .style('cursor', 'pointer')
455
+ .on('mouseenter', function(event, d) {
456
+ tipInner.textContent = d.explanation;
457
+ tip.style.opacity = '1';
458
+ const [mx, my] = d3.pointer(event, container);
459
+ tip.style.transform = `translate(${mx + 10}px, ${my + 10}px)`;
460
+ })
461
+ .on('mouseleave', function() {
462
+ tip.style.opacity = '0';
463
+ tip.style.transform = 'translate(-9999px, -9999px)';
464
+ });
465
+
466
+ metricBoxes.append('text')
467
+ .attr('class', 'metric-label')
468
+ .attr('x', d => d.width / 2)
469
+ .attr('y', 18)
470
+ .attr('text-anchor', 'middle')
471
+ .text(d => d.label);
472
+
473
+ metricBoxes.append('text')
474
+ .attr('class', 'metric-description')
475
+ .attr('x', d => d.width / 2)
476
+ .attr('y', 32)
477
+ .attr('text-anchor', 'middle')
478
+ .text(d => d.description);
479
+
480
+ metricBoxes.append('text')
481
+ .attr('class', 'metric-result')
482
+ .attr('x', d => d.width / 2)
483
+ .attr('y', 56)
484
+ .attr('text-anchor', 'middle')
485
+ .attr('fill', colors.metric)
486
+ .text(d => d.result);
487
+ }
488
+
489
+ render();
490
+
491
+ // Responsive handling
492
+ if (window.ResizeObserver) {
493
+ const ro = new ResizeObserver(() => render());
494
+ ro.observe(container);
495
+ } else {
496
+ window.addEventListener('resize', render);
497
+ }
498
+ };
499
+
500
+ if (document.readyState === 'loading') {
501
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
502
+ } else {
503
+ ensureD3(bootstrap);
504
+ }
505
+ })();
506
+ </script>