Clémentine commited on
Commit
bb4414b
·
1 Parent(s): 0322f30
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -4,6 +4,7 @@ title: "Designing your automatic evaluation"
4
 
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
 
7
  import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
8
 
9
  ### Dataset
@@ -114,10 +115,13 @@ When there is a ground truth, however, you can use automatic metrics, let's see
114
  #### Metrics
115
  Most ways to automatically compare a string of text to a reference are match based.
116
 
117
- The easiest but least flexible match based metrics are **exact matches** of token sequences. <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong.
118
 
119
- The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap.
120
- Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
 
 
 
121
 
122
  Once you have an accuracy score per sample, you can **aggregate** it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
123
 
 
4
 
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
7
+ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
8
  import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
9
 
10
  ### Dataset
 
115
  #### Metrics
116
  Most ways to automatically compare a string of text to a reference are match based.
117
 
118
+ The easiest but least flexible match based metrics are **exact matches** of token sequences. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong. <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>
119
 
120
+ The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap. A simpler version of these is the **TER** (translation error rate), number of edits required to go from a prediction to the correct reference (similar to an edit distance).
121
+ Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
122
+ I'm introducing here the most well known metrics, but all of these metrics have variations and extensions, among which CorpusBLEU, GLEU, MAUVE, METEOR, to cite a few.
123
+
124
+ <HtmlEmbed src="d3-text-metrics.html" frameless />
125
 
126
  Once you have an accuracy score per sample, you can **aggregate** it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
127
 
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED
@@ -136,7 +136,7 @@ Competitive bluffing games like [Poker](https://arxiv.org/html/2501.08328v1) (20
136
 
137
  What's also very neat about these is that they have a single and unambiguous pass/fail metric: did the LLM win the game or not? At the moment, if I were to use these to evaluate models I would probably look at TextQuests for abilities and Town of Salem for safety.
138
 
139
- Beyond testing capabilities in controlled environments, there's one type of evaluation that's inherently impossible to game: predicting the future. (Ok it's a tangent but I find these super fun and they could be relevant!)
140
 
141
  #### Forecasters
142
  In the last year, a new category of impossible to contaminate tasks emerged: forecasting. (I guess technically forecasting on the stock markets can be cheated on by some manipulation but hopefully we're not there yet in terms of financial incentives to mess up evals). They should require a combination of reasoning across sources to try to solve questions about not yet occuring events, but it's uncertain that these benchmarks are discriminative enough to have strong value, and they likely reinforce the "slot machine success" vibe of LLMs. (Is the performance on some events close to random because they are impossible to predict or because models are bad at it? In the other direction, if models are able to predict the event correctly, is the question too easy or too formulaic?)
 
136
 
137
  What's also very neat about these is that they have a single and unambiguous pass/fail metric: did the LLM win the game or not? At the moment, if I were to use these to evaluate models I would probably look at TextQuests for abilities and Town of Salem for safety.
138
 
139
+ Beyond testing capabilities in controlled environments, people have explored the ultimate ungameable task: predicting the future.
140
 
141
  #### Forecasters
142
  In the last year, a new category of impossible to contaminate tasks emerged: forecasting. (I guess technically forecasting on the stock markets can be cheated on by some manipulation but hopefully we're not there yet in terms of financial incentives to mess up evals). They should require a combination of reasoning across sources to try to solve questions about not yet occuring events, but it's uncertain that these benchmarks are discriminative enough to have strong value, and they likely reinforce the "slot machine success" vibe of LLMs. (Is the performance on some events close to random because they are impossible to predict or because models are bad at it? In the other direction, if models are able to predict the event correctly, is the question too easy or too formulaic?)
app/src/content/embeds/d3-binary-metrics.html ADDED
@@ -0,0 +1,400 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-binary-metrics"></div>
2
+
3
+ <style>
4
+ .d3-binary-metrics {
5
+ font-family: var(--default-font-family);
6
+ background: transparent;
7
+ border: none;
8
+ border-radius: 0;
9
+ padding: var(--spacing-4) 0;
10
+ width: 100%;
11
+ margin: 0 auto;
12
+ }
13
+
14
+ .d3-binary-metrics .metrics-container {
15
+ display: flex;
16
+ flex-direction: column;
17
+ gap: var(--spacing-4);
18
+ }
19
+
20
+ .d3-binary-metrics .confusion-matrix {
21
+ display: grid;
22
+ grid-template-columns: 100px 1fr 1fr;
23
+ grid-template-rows: 100px 1fr 1fr;
24
+ gap: 2px;
25
+ max-width: 400px;
26
+ margin: 0 auto;
27
+ }
28
+
29
+ .d3-binary-metrics .matrix-label {
30
+ display: flex;
31
+ align-items: center;
32
+ justify-content: center;
33
+ font-size: 14px;
34
+ font-weight: 600;
35
+ color: var(--text-color);
36
+ }
37
+
38
+ .d3-binary-metrics .matrix-header-row {
39
+ grid-column: 1;
40
+ grid-row: 1;
41
+ }
42
+
43
+ .d3-binary-metrics .matrix-header-col {
44
+ grid-row: 1;
45
+ grid-column: 1;
46
+ }
47
+
48
+ .d3-binary-metrics .predicted-label {
49
+ grid-column: 2 / 4;
50
+ grid-row: 1;
51
+ font-size: 13px;
52
+ font-weight: 700;
53
+ color: var(--primary-color);
54
+ text-transform: uppercase;
55
+ letter-spacing: 0.05em;
56
+ }
57
+
58
+ .d3-binary-metrics .actual-label {
59
+ grid-column: 1;
60
+ grid-row: 2 / 4;
61
+ writing-mode: vertical-rl;
62
+ transform: rotate(180deg);
63
+ font-size: 13px;
64
+ font-weight: 700;
65
+ color: var(--primary-color);
66
+ text-transform: uppercase;
67
+ letter-spacing: 0.05em;
68
+ }
69
+
70
+ .d3-binary-metrics .matrix-pos-label {
71
+ grid-column: 2;
72
+ grid-row: 1;
73
+ font-size: 12px;
74
+ padding-bottom: 10px;
75
+ }
76
+
77
+ .d3-binary-metrics .matrix-neg-label {
78
+ grid-column: 3;
79
+ grid-row: 1;
80
+ font-size: 12px;
81
+ padding-bottom: 10px;
82
+ }
83
+
84
+ .d3-binary-metrics .matrix-pos-label-row {
85
+ grid-column: 1;
86
+ grid-row: 2;
87
+ font-size: 12px;
88
+ padding-right: 10px;
89
+ }
90
+
91
+ .d3-binary-metrics .matrix-neg-label-row {
92
+ grid-column: 1;
93
+ grid-row: 3;
94
+ font-size: 12px;
95
+ padding-right: 10px;
96
+ }
97
+
98
+ .d3-binary-metrics .matrix-cell {
99
+ display: flex;
100
+ flex-direction: column;
101
+ align-items: center;
102
+ justify-content: center;
103
+ padding: var(--spacing-3);
104
+ border-radius: 8px;
105
+ min-height: 100px;
106
+ border: 2px solid;
107
+ transition: all 0.3s ease;
108
+ }
109
+
110
+ .d3-binary-metrics .matrix-cell:hover {
111
+ transform: scale(1.05);
112
+ box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
113
+ }
114
+
115
+ .d3-binary-metrics .cell-tp {
116
+ grid-column: 2;
117
+ grid-row: 2;
118
+ background: oklch(from var(--primary-color) calc(l + 0.35) calc(c * 0.8) h / 0.3);
119
+ border-color: oklch(from var(--primary-color) calc(l + 0.1) c h / 0.7);
120
+ }
121
+
122
+ .d3-binary-metrics .cell-fp {
123
+ grid-column: 3;
124
+ grid-row: 2;
125
+ background: oklch(from #ff6b6b calc(l + 0.35) c h / 0.25);
126
+ border-color: oklch(from #ff6b6b calc(l + 0.1) c h / 0.6);
127
+ }
128
+
129
+ .d3-binary-metrics .cell-fn {
130
+ grid-column: 2;
131
+ grid-row: 3;
132
+ background: oklch(from #ffa500 calc(l + 0.35) c h / 0.25);
133
+ border-color: oklch(from #ffa500 calc(l + 0.1) c h / 0.6);
134
+ }
135
+
136
+ .d3-binary-metrics .cell-tn {
137
+ grid-column: 3;
138
+ grid-row: 3;
139
+ background: oklch(from var(--primary-color) calc(l + 0.35) calc(c * 0.8) h / 0.3);
140
+ border-color: oklch(from var(--primary-color) calc(l + 0.1) c h / 0.7);
141
+ }
142
+
143
+ [data-theme="dark"] .d3-binary-metrics .cell-tp,
144
+ [data-theme="dark"] .d3-binary-metrics .cell-tn {
145
+ background: oklch(from var(--primary-color) calc(l + 0.25) calc(c * 0.8) h / 0.25);
146
+ border-color: oklch(from var(--primary-color) calc(l + 0.05) c h / 0.75);
147
+ }
148
+
149
+ [data-theme="dark"] .d3-binary-metrics .cell-fp {
150
+ background: oklch(from #ff6b6b calc(l + 0.25) c h / 0.2);
151
+ border-color: oklch(from #ff6b6b calc(l + 0.05) c h / 0.65);
152
+ }
153
+
154
+ [data-theme="dark"] .d3-binary-metrics .cell-fn {
155
+ background: oklch(from #ffa500 calc(l + 0.25) c h / 0.2);
156
+ border-color: oklch(from #ffa500 calc(l + 0.05) c h / 0.65);
157
+ }
158
+
159
+ .d3-binary-metrics .cell-label {
160
+ font-size: 11px;
161
+ font-weight: 700;
162
+ color: var(--text-color);
163
+ text-transform: uppercase;
164
+ letter-spacing: 0.05em;
165
+ margin-bottom: var(--spacing-1);
166
+ }
167
+
168
+ .d3-binary-metrics .cell-value {
169
+ font-size: 32px;
170
+ font-weight: 700;
171
+ color: var(--text-color);
172
+ }
173
+
174
+ .d3-binary-metrics .cell-description {
175
+ font-size: 10px;
176
+ color: var(--muted-color);
177
+ text-align: center;
178
+ margin-top: var(--spacing-1);
179
+ }
180
+
181
+ .d3-binary-metrics .metrics-grid {
182
+ display: grid;
183
+ grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
184
+ gap: var(--spacing-3);
185
+ margin-top: var(--spacing-4);
186
+ }
187
+
188
+ .d3-binary-metrics .metric-card {
189
+ background: oklch(from var(--primary-color) calc(l + 0.42) c h / 0.25);
190
+ border: 1px solid oklch(from var(--primary-color) calc(l + 0.2) c h / 0.5);
191
+ border-radius: 12px;
192
+ padding: var(--spacing-4);
193
+ display: flex;
194
+ flex-direction: column;
195
+ gap: var(--spacing-2);
196
+ }
197
+
198
+ [data-theme="dark"] .d3-binary-metrics .metric-card {
199
+ background: oklch(from var(--primary-color) calc(l + 0.32) c h / 0.2);
200
+ border-color: oklch(from var(--primary-color) calc(l + 0.15) c h / 0.55);
201
+ }
202
+
203
+ .d3-binary-metrics .metric-name {
204
+ font-size: 15px;
205
+ font-weight: 700;
206
+ color: var(--primary-color);
207
+ }
208
+
209
+ [data-theme="dark"] .d3-binary-metrics .metric-name {
210
+ color: oklch(from var(--primary-color) calc(l + 0.05) calc(c * 1.1) h);
211
+ }
212
+
213
+ .d3-binary-metrics .metric-formula {
214
+ font-size: 13px;
215
+ color: var(--text-color);
216
+ font-family: monospace;
217
+ background: var(--surface-bg);
218
+ padding: var(--spacing-2);
219
+ border-radius: 6px;
220
+ border: 1px solid var(--border-color);
221
+ }
222
+
223
+ .d3-binary-metrics .metric-value {
224
+ font-size: 24px;
225
+ font-weight: 700;
226
+ color: var(--primary-color);
227
+ text-align: center;
228
+ }
229
+
230
+ .d3-binary-metrics .metric-interpretation {
231
+ font-size: 12px;
232
+ color: var(--muted-color);
233
+ line-height: 1.4;
234
+ }
235
+
236
+ .d3-binary-metrics .example-title {
237
+ font-size: 16px;
238
+ font-weight: 700;
239
+ color: var(--primary-color);
240
+ text-align: center;
241
+ margin-bottom: var(--spacing-3);
242
+ }
243
+
244
+ .d3-binary-metrics .example-description {
245
+ font-size: 13px;
246
+ color: var(--text-color);
247
+ text-align: center;
248
+ font-style: italic;
249
+ margin-bottom: var(--spacing-4);
250
+ }
251
+
252
+ @media (max-width: 768px) {
253
+ .d3-binary-metrics .confusion-matrix {
254
+ max-width: 100%;
255
+ grid-template-columns: 80px 1fr 1fr;
256
+ grid-template-rows: 80px 1fr 1fr;
257
+ }
258
+
259
+ .d3-binary-metrics .matrix-cell {
260
+ min-height: 80px;
261
+ padding: var(--spacing-2);
262
+ }
263
+
264
+ .d3-binary-metrics .cell-value {
265
+ font-size: 24px;
266
+ }
267
+
268
+ .d3-binary-metrics .metrics-grid {
269
+ grid-template-columns: 1fr;
270
+ }
271
+ }
272
+ </style>
273
+
274
+ <script>
275
+ (() => {
276
+ const bootstrap = () => {
277
+ const scriptEl = document.currentScript;
278
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
279
+ if (!(container && container.classList && container.classList.contains('d3-binary-metrics'))) {
280
+ const candidates = Array.from(document.querySelectorAll('.d3-binary-metrics'))
281
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
282
+ container = candidates[candidates.length - 1] || null;
283
+ }
284
+
285
+ if (!container) return;
286
+
287
+ if (container.dataset) {
288
+ if (container.dataset.mounted === 'true') return;
289
+ container.dataset.mounted = 'true';
290
+ }
291
+
292
+ // Example: Question answering - checking if answer is correct
293
+ const TP = 45; // Correctly identified as correct answer
294
+ const FP = 8; // Incorrect answer marked as correct
295
+ const FN = 5; // Correct answer marked as incorrect
296
+ const TN = 42; // Correctly identified as incorrect answer
297
+
298
+ // Calculate metrics
299
+ const precision = TP / (TP + FP);
300
+ const recall = TP / (TP + FN);
301
+ const f1 = 2 * (precision * recall) / (precision + recall);
302
+
303
+ // MCC calculation
304
+ const numerator = (TP * TN) - (FP * FN);
305
+ const denominator = Math.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN));
306
+ const mcc = numerator / denominator;
307
+
308
+ container.innerHTML = `
309
+ <div class="metrics-container">
310
+ <div class="example-title">Binary Classification Metrics Example</div>
311
+ <div class="example-description">
312
+ Question Answering: 100 model predictions evaluated (50 correct, 50 incorrect)
313
+ </div>
314
+
315
+ <div class="confusion-matrix">
316
+ <div class="matrix-label predicted-label">Predicted</div>
317
+ <div class="matrix-label actual-label">Actual</div>
318
+
319
+ <div class="matrix-label matrix-pos-label">Correct</div>
320
+ <div class="matrix-label matrix-neg-label">Incorrect</div>
321
+ <div class="matrix-label matrix-pos-label-row">Correct</div>
322
+ <div class="matrix-label matrix-neg-label-row">Incorrect</div>
323
+
324
+ <div class="matrix-cell cell-tp">
325
+ <div class="cell-label">True Positive</div>
326
+ <div class="cell-value">${TP}</div>
327
+ <div class="cell-description">Correct answer identified as correct</div>
328
+ </div>
329
+
330
+ <div class="matrix-cell cell-fp">
331
+ <div class="cell-label">False Positive</div>
332
+ <div class="cell-value">${FP}</div>
333
+ <div class="cell-description">Incorrect answer marked as correct</div>
334
+ </div>
335
+
336
+ <div class="matrix-cell cell-fn">
337
+ <div class="cell-label">False Negative</div>
338
+ <div class="cell-value">${FN}</div>
339
+ <div class="cell-description">Correct answer marked as incorrect</div>
340
+ </div>
341
+
342
+ <div class="matrix-cell cell-tn">
343
+ <div class="cell-label">True Negative</div>
344
+ <div class="cell-value">${TN}</div>
345
+ <div class="cell-description">Incorrect answer identified as incorrect</div>
346
+ </div>
347
+ </div>
348
+
349
+ <div class="metrics-grid">
350
+ <div class="metric-card">
351
+ <div class="metric-name">Precision</div>
352
+ <div class="metric-formula">TP / (TP + FP)</div>
353
+ <div class="metric-value">${precision.toFixed(3)}</div>
354
+ <div class="metric-interpretation">
355
+ ${(precision * 100).toFixed(1)}% of answers marked correct are actually correct.
356
+ Critical when false positives (wrong answers accepted) are costly.
357
+ </div>
358
+ </div>
359
+
360
+ <div class="metric-card">
361
+ <div class="metric-name">Recall</div>
362
+ <div class="metric-formula">TP / (TP + FN)</div>
363
+ <div class="metric-value">${recall.toFixed(3)}</div>
364
+ <div class="metric-interpretation">
365
+ ${(recall * 100).toFixed(1)}% of actually correct answers were identified.
366
+ Critical when missing positives (rejecting correct answers) is costly.
367
+ </div>
368
+ </div>
369
+
370
+ <div class="metric-card">
371
+ <div class="metric-name">F1 Score</div>
372
+ <div class="metric-formula">2 × (P × R) / (P + R)</div>
373
+ <div class="metric-value">${f1.toFixed(3)}</div>
374
+ <div class="metric-interpretation">
375
+ Harmonic mean of precision and recall.
376
+ Balances both metrics, good for imbalanced data.
377
+ </div>
378
+ </div>
379
+
380
+ <div class="metric-card">
381
+ <div class="metric-name">MCC</div>
382
+ <div class="metric-formula">(TP×TN - FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))</div>
383
+ <div class="metric-value">${mcc.toFixed(3)}</div>
384
+ <div class="metric-interpretation">
385
+ Matthews Correlation Coefficient ranges from -1 to +1.
386
+ Works well with imbalanced datasets.
387
+ </div>
388
+ </div>
389
+ </div>
390
+ </div>
391
+ `;
392
+ };
393
+
394
+ if (document.readyState === 'loading') {
395
+ document.addEventListener('DOMContentLoaded', bootstrap, { once: true });
396
+ } else {
397
+ bootstrap();
398
+ }
399
+ })();
400
+ </script>
app/src/content/embeds/d3-metrics-comparison.html ADDED
@@ -0,0 +1,572 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-metrics-comparison"></div>
2
+
3
+ <style>
4
+ .d3-metrics-comparison {
5
+ font-family: var(--default-font-family);
6
+ background: transparent;
7
+ border: none;
8
+ border-radius: 0;
9
+ padding: var(--spacing-4) 0;
10
+ width: 100%;
11
+ margin: 0 auto;
12
+ position: relative;
13
+ }
14
+
15
+ .d3-metrics-comparison svg {
16
+ width: 100%;
17
+ height: auto;
18
+ display: block;
19
+ }
20
+
21
+ .d3-metrics-comparison .node-rect {
22
+ stroke-width: 2;
23
+ transition: all 0.3s ease;
24
+ }
25
+
26
+ .d3-metrics-comparison .node-rect:hover {
27
+ filter: brightness(1.1);
28
+ stroke-width: 3;
29
+ }
30
+
31
+ .d3-metrics-comparison .input-node {
32
+ fill: oklch(from var(--primary-color) calc(l + 0.42) c h / 0.35);
33
+ stroke: oklch(from var(--primary-color) calc(l + 0.1) c h / 0.7);
34
+ }
35
+
36
+ .d3-metrics-comparison .method-node {
37
+ fill: oklch(from var(--primary-color) calc(l + 0.38) c h / 0.45);
38
+ stroke: var(--primary-color);
39
+ }
40
+
41
+ .d3-metrics-comparison .score-node {
42
+ fill: oklch(from var(--primary-color) calc(l + 0.35) c h / 0.55);
43
+ stroke: oklch(from var(--primary-color) calc(l - 0.05) calc(c * 1.2) h);
44
+ }
45
+
46
+ [data-theme="dark"] .d3-metrics-comparison .input-node {
47
+ fill: oklch(from var(--primary-color) calc(l + 0.32) c h / 0.3);
48
+ stroke: oklch(from var(--primary-color) calc(l + 0.05) c h / 0.75);
49
+ }
50
+
51
+ [data-theme="dark"] .d3-metrics-comparison .method-node {
52
+ fill: oklch(from var(--primary-color) calc(l + 0.28) c h / 0.4);
53
+ stroke: oklch(from var(--primary-color) calc(l + 0.05) calc(c * 1.1) h);
54
+ }
55
+
56
+ [data-theme="dark"] .d3-metrics-comparison .score-node {
57
+ fill: oklch(from var(--primary-color) calc(l + 0.25) c h / 0.5);
58
+ stroke: oklch(from var(--primary-color) calc(l) calc(c * 1.3) h);
59
+ }
60
+
61
+ .d3-metrics-comparison .node-label {
62
+ fill: var(--text-color);
63
+ font-size: 13px;
64
+ font-weight: 600;
65
+ pointer-events: none;
66
+ user-select: none;
67
+ }
68
+
69
+ .d3-metrics-comparison .node-sublabel {
70
+ fill: var(--muted-color);
71
+ font-size: 10px;
72
+ font-weight: 500;
73
+ pointer-events: none;
74
+ user-select: none;
75
+ }
76
+
77
+ .d3-metrics-comparison .node-example {
78
+ fill: var(--text-color);
79
+ font-size: 10px;
80
+ font-weight: 500;
81
+ font-style: italic;
82
+ pointer-events: none;
83
+ user-select: none;
84
+ }
85
+
86
+ .d3-metrics-comparison .link-path {
87
+ fill: none;
88
+ stroke: oklch(from var(--primary-color) l c h / 0.4);
89
+ stroke-width: 2;
90
+ transition: all 0.3s ease;
91
+ }
92
+
93
+ [data-theme="dark"] .d3-metrics-comparison .link-path {
94
+ stroke: oklch(from var(--primary-color) l c h / 0.5);
95
+ }
96
+
97
+ .d3-metrics-comparison .link-path:hover {
98
+ stroke: var(--primary-color);
99
+ stroke-width: 3;
100
+ }
101
+
102
+ .d3-metrics-comparison .link-label {
103
+ fill: var(--text-color);
104
+ font-size: 10px;
105
+ font-weight: 600;
106
+ pointer-events: none;
107
+ user-select: none;
108
+ }
109
+
110
+ .d3-metrics-comparison .score-badge {
111
+ fill: var(--primary-color);
112
+ font-size: 14px;
113
+ font-weight: 700;
114
+ pointer-events: none;
115
+ user-select: none;
116
+ }
117
+
118
+ .d3-metrics-comparison .score-badge-bg {
119
+ fill: var(--surface-bg);
120
+ stroke: var(--primary-color);
121
+ stroke-width: 2;
122
+ }
123
+
124
+ .d3-metrics-comparison .section-title {
125
+ fill: var(--primary-color);
126
+ font-size: 12px;
127
+ font-weight: 700;
128
+ text-transform: uppercase;
129
+ letter-spacing: 0.05em;
130
+ }
131
+
132
+ [data-theme="dark"] .d3-metrics-comparison .section-title {
133
+ fill: oklch(from var(--primary-color) calc(l + 0.1) calc(c * 1.2) h);
134
+ }
135
+
136
+ .d3-metrics-comparison .marker {
137
+ fill: oklch(from var(--primary-color) l c h / 0.6);
138
+ }
139
+
140
+ .d3-metrics-comparison .tooltip {
141
+ position: absolute;
142
+ background: var(--surface-bg);
143
+ border: 1px solid var(--border-color);
144
+ border-radius: 8px;
145
+ padding: 10px 14px;
146
+ font-size: 12px;
147
+ pointer-events: none;
148
+ opacity: 0;
149
+ transition: opacity 0.2s ease;
150
+ box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
151
+ z-index: 1000;
152
+ max-width: 350px;
153
+ line-height: 1.5;
154
+ white-space: pre-line;
155
+ color: var(--text-color);
156
+ }
157
+
158
+ .d3-metrics-comparison .tooltip.visible {
159
+ opacity: 1;
160
+ }
161
+
162
+ @media (max-width: 768px) {
163
+ .d3-metrics-comparison .node-label {
164
+ font-size: 11px;
165
+ }
166
+
167
+ .d3-metrics-comparison .node-sublabel {
168
+ font-size: 9px;
169
+ }
170
+
171
+ .d3-metrics-comparison .node-example {
172
+ font-size: 9px;
173
+ }
174
+
175
+ .d3-metrics-comparison .link-label {
176
+ font-size: 9px;
177
+ }
178
+
179
+ .d3-metrics-comparison .score-badge {
180
+ font-size: 12px;
181
+ }
182
+ }
183
+ </style>
184
+
185
+ <script>
186
+ (() => {
187
+ const ensureD3 = (cb) => {
188
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
189
+ let s = document.getElementById('d3-cdn-script');
190
+ if (!s) {
191
+ s = document.createElement('script');
192
+ s.id = 'd3-cdn-script';
193
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
194
+ document.head.appendChild(s);
195
+ }
196
+ const onReady = () => {
197
+ if (window.d3 && typeof window.d3.select === 'function') cb();
198
+ };
199
+ s.addEventListener('load', onReady, { once: true });
200
+ if (window.d3) onReady();
201
+ };
202
+
203
+ const bootstrap = () => {
204
+ const scriptEl = document.currentScript;
205
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
206
+ if (!(container && container.classList && container.classList.contains('d3-metrics-comparison'))) {
207
+ const candidates = Array.from(document.querySelectorAll('.d3-metrics-comparison'))
208
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
209
+ container = candidates[candidates.length - 1] || null;
210
+ }
211
+
212
+ if (!container) return;
213
+
214
+ if (container.dataset) {
215
+ if (container.dataset.mounted === 'true') return;
216
+ container.dataset.mounted = 'true';
217
+ }
218
+
219
+ container.style.position = 'relative';
220
+
221
+ // Tooltip
222
+ const tooltip = document.createElement('div');
223
+ tooltip.className = 'tooltip';
224
+ container.appendChild(tooltip);
225
+
226
+ // Data structure: inputs -> methods -> scores
227
+ const data = {
228
+ inputs: [
229
+ {
230
+ id: 'prediction',
231
+ label: 'Prediction',
232
+ sublabel: '(model output)',
233
+ example: '"Evaluation is an amazing topic"'
234
+ },
235
+ {
236
+ id: 'reference',
237
+ label: 'Reference',
238
+ sublabel: '(ground truth)',
239
+ example: '"Evaluation is amazing"'
240
+ }
241
+ ],
242
+ methods: [
243
+ {
244
+ id: 'exact',
245
+ label: 'Exact Match',
246
+ sublabel: 'token sequences',
247
+ score: '0',
248
+ description: 'Strings don\'t match exactly—missing words "an" and "topic"',
249
+ scoreType: 'binary'
250
+ },
251
+ {
252
+ id: 'bleu',
253
+ label: 'BLEU',
254
+ sublabel: 'n-gram overlap',
255
+ score: '0.13',
256
+ description: 'Actual BLEU computation:\n• BLEU-1 (unigrams): 0.60 (3/5 match)\n• BLEU-2 (bigrams): 0.39 (1/4 match)\n• BLEU-3 (trigrams): 0.17 (0/3 match)\n• Final BLEU (geometric mean): 0.13\n• Brevity penalty reduces score (prediction > reference)',
257
+ scoreType: 'continuous'
258
+ },
259
+ {
260
+ id: 'rouge',
261
+ label: 'ROUGE',
262
+ sublabel: 'recall-oriented',
263
+ score: '0.75',
264
+ description: 'ROUGE-1 (unigram) scores:\n• Recall: 3/3 = 100% (all reference words found in prediction)\n• Precision: 3/5 = 60% (prediction words in reference)\n• F1 score: 0.75\nReference: ["evaluation", "is", "amazing"]',
265
+ scoreType: 'continuous'
266
+ },
267
+ {
268
+ id: 'bleurt',
269
+ label: 'BLEURT',
270
+ sublabel: 'semantic similarity',
271
+ score: '0.85',
272
+ description: 'High semantic similarity—both express positive sentiment about evaluation',
273
+ scoreType: 'continuous'
274
+ }
275
+ ],
276
+ scores: [
277
+ {
278
+ id: 'binary',
279
+ label: 'Binary Score',
280
+ sublabel: 'correct/incorrect'
281
+ },
282
+ {
283
+ id: 'continuous',
284
+ label: 'Continuous Score',
285
+ sublabel: '0.0 to 1.0'
286
+ }
287
+ ]
288
+ };
289
+
290
+ const svg = d3.select(container).append('svg');
291
+ const g = svg.append('g');
292
+
293
+ // Arrow marker
294
+ svg.append('defs').append('marker')
295
+ .attr('id', 'arrowhead')
296
+ .attr('viewBox', '0 -5 10 10')
297
+ .attr('refX', 8)
298
+ .attr('refY', 0)
299
+ .attr('markerWidth', 6)
300
+ .attr('markerHeight', 6)
301
+ .attr('orient', 'auto')
302
+ .append('path')
303
+ .attr('d', 'M0,-5L10,0L0,5')
304
+ .attr('class', 'marker');
305
+
306
+ let width = 800;
307
+ let height = 500;
308
+
309
+ function wrapText(text, maxWidth) {
310
+ const words = text.split(' ');
311
+ const lines = [];
312
+ let currentLine = words[0];
313
+
314
+ for (let i = 1; i < words.length; i++) {
315
+ const word = words[i];
316
+ const testLine = currentLine + ' ' + word;
317
+ if (testLine.length * 6 < maxWidth) {
318
+ currentLine = testLine;
319
+ } else {
320
+ lines.push(currentLine);
321
+ currentLine = word;
322
+ }
323
+ }
324
+ lines.push(currentLine);
325
+ return lines;
326
+ }
327
+
328
+ function render() {
329
+ width = container.clientWidth || 800;
330
+ height = Math.max(500, Math.round(width * 0.7));
331
+
332
+ svg.attr('width', width).attr('height', height);
333
+
334
+ const margin = { top: 40, right: 20, bottom: 20, left: 20 };
335
+ const innerWidth = width - margin.left - margin.right;
336
+ const innerHeight = height - margin.top - margin.bottom;
337
+
338
+ g.attr('transform', `translate(${margin.left},${margin.top})`);
339
+
340
+ // Clear previous content
341
+ g.selectAll('*').remove();
342
+
343
+ // Column positions with increased horizontal spacing
344
+ const nodeWidth = Math.min(150, innerWidth * 0.2);
345
+ const nodeHeight = 85;
346
+ const gapBetweenColumns = Math.max(80, innerWidth * 0.15);
347
+
348
+ // Calculate column centers with larger gaps
349
+ const col1X = nodeWidth / 2 + 20;
350
+ const col2X = col1X + nodeWidth / 2 + gapBetweenColumns + nodeWidth / 2;
351
+ const col3X = col2X + nodeWidth / 2 + gapBetweenColumns + nodeWidth / 2;
352
+
353
+ // Section titles
354
+ g.selectAll('.section-title')
355
+ .data([
356
+ { x: col1X, label: 'INPUTS' },
357
+ { x: col2X, label: 'COMPARISON METHODS' },
358
+ { x: col3X, label: 'SCORES' }
359
+ ])
360
+ .join('text')
361
+ .attr('class', 'section-title')
362
+ .attr('x', d => d.x)
363
+ .attr('y', -15)
364
+ .attr('text-anchor', 'middle')
365
+ .text(d => d.label);
366
+
367
+ // Calculate positions
368
+ const inputY = innerHeight * 0.25;
369
+ const methodStartY = 40;
370
+ const methodSpacing = (innerHeight - methodStartY - nodeHeight) / (data.methods.length - 1);
371
+
372
+ // Position score nodes to align with specific methods
373
+ // Binary score aligns with Exact Match (index 0)
374
+ // Continuous score aligns with ROUGE (index 2)
375
+ const exactMatchY = methodStartY + 0 * methodSpacing;
376
+ const rougeY = methodStartY + 2 * methodSpacing;
377
+
378
+ // Position nodes
379
+ const inputNodes = data.inputs.map((d, i) => ({
380
+ ...d,
381
+ x: col1X - nodeWidth / 2,
382
+ y: inputY + i * (nodeHeight + 30),
383
+ width: nodeWidth,
384
+ height: nodeHeight,
385
+ type: 'input'
386
+ }));
387
+
388
+ const methodNodes = data.methods.map((d, i) => ({
389
+ ...d,
390
+ x: col2X - nodeWidth / 2,
391
+ y: methodStartY + i * methodSpacing,
392
+ width: nodeWidth,
393
+ height: nodeHeight,
394
+ type: 'method'
395
+ }));
396
+
397
+ const scoreNodes = data.scores.map((d, i) => {
398
+ // Binary score aligns with Exact Match, Continuous with ROUGE
399
+ const yPos = d.id === 'binary' ? exactMatchY : rougeY;
400
+ return {
401
+ ...d,
402
+ x: col3X - nodeWidth / 2,
403
+ y: yPos,
404
+ width: nodeWidth,
405
+ height: nodeHeight,
406
+ type: 'score'
407
+ };
408
+ });
409
+
410
+ const allNodes = [...inputNodes, ...methodNodes, ...scoreNodes];
411
+
412
+ // Create links: inputs -> methods -> scores
413
+ const links = [];
414
+
415
+ // Each input connects to all methods
416
+ inputNodes.forEach(input => {
417
+ methodNodes.forEach(method => {
418
+ links.push({
419
+ source: input,
420
+ target: method,
421
+ type: 'input-method'
422
+ });
423
+ });
424
+ });
425
+
426
+ // Each method connects to appropriate score type
427
+ methodNodes.forEach(method => {
428
+ const targetScore = scoreNodes.find(s => s.id === method.scoreType);
429
+ if (targetScore) {
430
+ links.push({
431
+ source: method,
432
+ target: targetScore,
433
+ type: 'method-score',
434
+ score: method.score
435
+ });
436
+ }
437
+ });
438
+
439
+ // Draw links
440
+ const linkGroup = g.append('g').attr('class', 'links');
441
+
442
+ linkGroup.selectAll('.link-path')
443
+ .data(links)
444
+ .join('path')
445
+ .attr('class', 'link-path')
446
+ .attr('d', d => {
447
+ const sx = d.source.x + d.source.width;
448
+ const sy = d.source.y + d.source.height / 2;
449
+ const tx = d.target.x;
450
+ const ty = d.target.y + d.target.height / 2;
451
+ const mx = (sx + tx) / 2;
452
+ return `M ${sx} ${sy} C ${mx} ${sy}, ${mx} ${ty}, ${tx} ${ty}`;
453
+ })
454
+ .attr('marker-end', 'url(#arrowhead)');
455
+
456
+ // Add score badges on method->score links
457
+ const scoreBadges = linkGroup.selectAll('.score-badge-group')
458
+ .data(links.filter(d => d.type === 'method-score'))
459
+ .join('g')
460
+ .attr('class', 'score-badge-group')
461
+ .attr('transform', d => {
462
+ const sx = d.source.x + d.source.width;
463
+ const sy = d.source.y + d.source.height / 2;
464
+ const tx = d.target.x;
465
+ const ty = d.target.y + d.target.height / 2;
466
+ const mx = (sx + tx) / 2;
467
+ const my = (sy + ty) / 2;
468
+ return `translate(${mx}, ${my})`;
469
+ });
470
+
471
+ scoreBadges.append('rect')
472
+ .attr('class', 'score-badge-bg')
473
+ .attr('x', -20)
474
+ .attr('y', -12)
475
+ .attr('width', 40)
476
+ .attr('height', 24)
477
+ .attr('rx', 6);
478
+
479
+ scoreBadges.append('text')
480
+ .attr('class', 'score-badge')
481
+ .attr('text-anchor', 'middle')
482
+ .attr('dominant-baseline', 'middle')
483
+ .text(d => d.score);
484
+
485
+ // Draw nodes
486
+ const nodeGroup = g.append('g').attr('class', 'nodes');
487
+
488
+ const nodes = nodeGroup.selectAll('.node')
489
+ .data(allNodes)
490
+ .join('g')
491
+ .attr('class', 'node')
492
+ .attr('transform', d => `translate(${d.x},${d.y})`)
493
+ .style('cursor', 'pointer');
494
+
495
+ nodes.append('rect')
496
+ .attr('class', d => `node-rect ${d.type}-node`)
497
+ .attr('width', d => d.width)
498
+ .attr('height', d => d.height)
499
+ .attr('rx', 8)
500
+ .on('mouseenter', function(event, d) {
501
+ if (d.description) {
502
+ tooltip.textContent = d.description;
503
+ tooltip.classList.add('visible');
504
+ const rect = container.getBoundingClientRect();
505
+ tooltip.style.left = (event.clientX - rect.left + 10) + 'px';
506
+ tooltip.style.top = (event.clientY - rect.top + 10) + 'px';
507
+ }
508
+ })
509
+ .on('mouseleave', function() {
510
+ tooltip.classList.remove('visible');
511
+ });
512
+
513
+ nodes.append('text')
514
+ .attr('class', 'node-label')
515
+ .attr('x', d => d.width / 2)
516
+ .attr('y', 18)
517
+ .attr('text-anchor', 'middle')
518
+ .text(d => d.label);
519
+
520
+ nodes.append('text')
521
+ .attr('class', 'node-sublabel')
522
+ .attr('x', d => d.width / 2)
523
+ .attr('y', 32)
524
+ .attr('text-anchor', 'middle')
525
+ .text(d => d.sublabel);
526
+
527
+ // Add example text to input nodes
528
+ nodes.filter(d => d.type === 'input' && d.example)
529
+ .each(function(d) {
530
+ const node = d3.select(this);
531
+ const lines = wrapText(d.example, d.width - 16);
532
+ lines.forEach((line, i) => {
533
+ node.append('text')
534
+ .attr('class', 'node-example')
535
+ .attr('x', d.width / 2)
536
+ .attr('y', 48 + i * 12)
537
+ .attr('text-anchor', 'middle')
538
+ .text(line);
539
+ });
540
+ });
541
+
542
+ // Score is shown on the arrows, not in the method nodes
543
+
544
+ // Add aggregation info to score nodes
545
+ nodes.filter(d => d.type === 'score' && d.aggregations)
546
+ .append('text')
547
+ .attr('class', 'node-sublabel')
548
+ .attr('x', d => d.width / 2)
549
+ .attr('y', d => d.height - 12)
550
+ .attr('text-anchor', 'middle')
551
+ .attr('font-size', '9px')
552
+ .text(d => `${d.aggregations.slice(0, 2).join(', ')}...`);
553
+ }
554
+
555
+ render();
556
+
557
+ // Responsive handling
558
+ if (window.ResizeObserver) {
559
+ const ro = new ResizeObserver(() => render());
560
+ ro.observe(container);
561
+ } else {
562
+ window.addEventListener('resize', render);
563
+ }
564
+ };
565
+
566
+ if (document.readyState === 'loading') {
567
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
568
+ } else {
569
+ ensureD3(bootstrap);
570
+ }
571
+ })();
572
+ </script>
app/src/content/embeds/d3-precision-recall.html ADDED
@@ -0,0 +1,348 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-precision-recall"></div>
2
+
3
+ <style>
4
+ .d3-precision-recall {
5
+ font-family: var(--default-font-family);
6
+ background: transparent;
7
+ border: none;
8
+ border-radius: 0;
9
+ padding: var(--spacing-4) 0;
10
+ width: 100%;
11
+ margin: 0 auto;
12
+ }
13
+
14
+ .d3-precision-recall svg {
15
+ width: 100%;
16
+ height: auto;
17
+ display: block;
18
+ }
19
+
20
+ .d3-precision-recall .circle {
21
+ fill: none;
22
+ stroke-width: 3;
23
+ opacity: 0.8;
24
+ }
25
+
26
+ .d3-precision-recall .circle-predicted {
27
+ stroke: #4A90E2;
28
+ fill: #4A90E2;
29
+ fill-opacity: 0.15;
30
+ }
31
+
32
+ .d3-precision-recall .circle-relevant {
33
+ stroke: #F5A623;
34
+ fill: #F5A623;
35
+ fill-opacity: 0.15;
36
+ }
37
+
38
+ .d3-precision-recall .intersection {
39
+ fill: #7ED321;
40
+ fill-opacity: 0.3;
41
+ }
42
+
43
+ [data-theme="dark"] .d3-precision-recall .circle-predicted {
44
+ stroke: #5DA9FF;
45
+ fill: #5DA9FF;
46
+ }
47
+
48
+ [data-theme="dark"] .d3-precision-recall .circle-relevant {
49
+ stroke: #FFB84D;
50
+ fill: #FFB84D;
51
+ }
52
+
53
+ [data-theme="dark"] .d3-precision-recall .intersection {
54
+ fill: #94E842;
55
+ }
56
+
57
+ .d3-precision-recall .label {
58
+ font-size: 14px;
59
+ font-weight: 600;
60
+ fill: var(--text-color);
61
+ }
62
+
63
+ .d3-precision-recall .count-label {
64
+ font-size: 13px;
65
+ font-weight: 500;
66
+ fill: var(--text-color);
67
+ }
68
+
69
+ .d3-precision-recall .formula-text {
70
+ font-size: 12px;
71
+ fill: var(--text-color);
72
+ }
73
+
74
+ .d3-precision-recall .formula-box {
75
+ fill: var(--surface-bg);
76
+ stroke: var(--border-color);
77
+ stroke-width: 1;
78
+ }
79
+
80
+ .d3-precision-recall .section-title {
81
+ font-size: 16px;
82
+ font-weight: 700;
83
+ fill: var(--primary-color);
84
+ text-anchor: middle;
85
+ }
86
+
87
+ .d3-precision-recall .legend-text {
88
+ font-size: 11px;
89
+ fill: var(--text-color);
90
+ }
91
+
92
+ .d3-precision-recall .legend-rect {
93
+ stroke-width: 1.5;
94
+ }
95
+ </style>
96
+
97
+ <script>
98
+ (() => {
99
+ const ensureD3 = (cb) => {
100
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
101
+ let s = document.getElementById('d3-cdn-script');
102
+ if (!s) {
103
+ s = document.createElement('script');
104
+ s.id = 'd3-cdn-script';
105
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
106
+ document.head.appendChild(s);
107
+ }
108
+ const onReady = () => {
109
+ if (window.d3 && typeof window.d3.select === 'function') cb();
110
+ };
111
+ s.addEventListener('load', onReady, { once: true });
112
+ if (window.d3) onReady();
113
+ };
114
+
115
+ const bootstrap = () => {
116
+ const scriptEl = document.currentScript;
117
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
118
+ if (!(container && container.classList && container.classList.contains('d3-precision-recall'))) {
119
+ const candidates = Array.from(document.querySelectorAll('.d3-precision-recall'))
120
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
121
+ container = candidates[candidates.length - 1] || null;
122
+ }
123
+
124
+ if (!container) return;
125
+
126
+ if (container.dataset) {
127
+ if (container.dataset.mounted === 'true') return;
128
+ container.dataset.mounted = 'true';
129
+ }
130
+
131
+ const svg = d3.select(container).append('svg');
132
+ const g = svg.append('g');
133
+
134
+ let width = 800;
135
+ let height = 500;
136
+
137
+ function render() {
138
+ width = container.clientWidth || 800;
139
+ height = Math.max(400, Math.round(width * 0.5));
140
+
141
+ svg.attr('width', width).attr('height', height);
142
+
143
+ const margin = { top: 40, right: 40, bottom: 80, left: 40 };
144
+ const innerWidth = width - margin.left - margin.right;
145
+ const innerHeight = height - margin.top - margin.bottom;
146
+
147
+ g.attr('transform', `translate(${margin.left},${margin.top})`);
148
+ g.selectAll('*').remove();
149
+
150
+ // Example: Question answering with exact match
151
+ const TP = 45; // True Positives (correct answers identified)
152
+ const FP = 8; // False Positives (incorrect marked as correct)
153
+ const FN = 5; // False Negatives (correct marked as incorrect)
154
+
155
+ const totalPredicted = TP + FP; // All predicted as correct
156
+ const totalRelevant = TP + FN; // All actually correct
157
+
158
+ const precision = TP / totalPredicted;
159
+ const recall = TP / totalRelevant;
160
+
161
+ // Circle parameters
162
+ const radius = Math.min(innerWidth, innerHeight) * 0.25;
163
+ const overlapOffset = radius * 0.6;
164
+
165
+ const predictedX = innerWidth * 0.35;
166
+ const relevantX = predictedX + overlapOffset;
167
+ const centerY = innerHeight * 0.4;
168
+
169
+ // Title
170
+ g.append('text')
171
+ .attr('class', 'section-title')
172
+ .attr('x', innerWidth / 2)
173
+ .attr('y', -10)
174
+ .text('Precision and Recall Visualization');
175
+
176
+ // Draw circles
177
+ g.append('circle')
178
+ .attr('class', 'circle circle-predicted')
179
+ .attr('cx', predictedX)
180
+ .attr('cy', centerY)
181
+ .attr('r', radius);
182
+
183
+ g.append('circle')
184
+ .attr('class', 'circle circle-relevant')
185
+ .attr('cx', relevantX)
186
+ .attr('cy', centerY)
187
+ .attr('r', radius);
188
+
189
+ // Calculate intersection area (approximate)
190
+ const intersectionX = (predictedX + relevantX) / 2;
191
+
192
+ // Draw intersection highlight
193
+ const clipId = 'clip-intersection-' + Math.random().toString(36).substr(2, 9);
194
+ const defs = g.append('defs');
195
+ const clipPath = defs.append('clipPath').attr('id', clipId);
196
+
197
+ clipPath.append('circle')
198
+ .attr('cx', predictedX)
199
+ .attr('cy', centerY)
200
+ .attr('r', radius);
201
+
202
+ g.append('circle')
203
+ .attr('class', 'intersection')
204
+ .attr('cx', relevantX)
205
+ .attr('cy', centerY)
206
+ .attr('r', radius)
207
+ .attr('clip-path', `url(#${clipId})`);
208
+
209
+ // Labels for circles
210
+ g.append('text')
211
+ .attr('class', 'label')
212
+ .attr('x', predictedX - radius * 0.7)
213
+ .attr('y', centerY - radius - 15)
214
+ .attr('text-anchor', 'middle')
215
+ .text('Predicted Correct');
216
+
217
+ g.append('text')
218
+ .attr('class', 'label')
219
+ .attr('x', relevantX + radius * 0.7)
220
+ .attr('y', centerY - radius - 15)
221
+ .attr('text-anchor', 'middle')
222
+ .text('Actually Correct');
223
+
224
+ // Count labels inside circles
225
+ // Left part (FP)
226
+ g.append('text')
227
+ .attr('class', 'count-label')
228
+ .attr('x', predictedX - radius * 0.5)
229
+ .attr('y', centerY)
230
+ .attr('text-anchor', 'middle')
231
+ .attr('fill', '#4A90E2')
232
+ .text(`FP: ${FP}`);
233
+
234
+ // Intersection (TP)
235
+ g.append('text')
236
+ .attr('class', 'count-label')
237
+ .attr('x', intersectionX)
238
+ .attr('y', centerY)
239
+ .attr('text-anchor', 'middle')
240
+ .attr('fill', '#7ED321')
241
+ .style('font-weight', '700')
242
+ .text(`TP: ${TP}`);
243
+
244
+ // Right part (FN)
245
+ g.append('text')
246
+ .attr('class', 'count-label')
247
+ .attr('x', relevantX + radius * 0.5)
248
+ .attr('y', centerY)
249
+ .attr('text-anchor', 'middle')
250
+ .attr('fill', '#F5A623')
251
+ .text(`FN: ${FN}`);
252
+
253
+ // Formula boxes at bottom
254
+ const formulaY = centerY + radius + 60;
255
+ const boxWidth = Math.min(200, innerWidth * 0.35);
256
+ const boxHeight = 80;
257
+ const boxGap = 40;
258
+
259
+ const precisionX = innerWidth * 0.3 - boxWidth / 2;
260
+ const recallX = innerWidth * 0.7 - boxWidth / 2;
261
+
262
+ // Precision box
263
+ g.append('rect')
264
+ .attr('class', 'formula-box')
265
+ .attr('x', precisionX)
266
+ .attr('y', formulaY - boxHeight / 2)
267
+ .attr('width', boxWidth)
268
+ .attr('height', boxHeight)
269
+ .attr('rx', 8);
270
+
271
+ g.append('text')
272
+ .attr('class', 'label')
273
+ .attr('x', precisionX + boxWidth / 2)
274
+ .attr('y', formulaY - boxHeight / 2 + 20)
275
+ .attr('text-anchor', 'middle')
276
+ .attr('fill', '#4A90E2')
277
+ .text('Precision');
278
+
279
+ g.append('text')
280
+ .attr('class', 'formula-text')
281
+ .attr('x', precisionX + boxWidth / 2)
282
+ .attr('y', formulaY - boxHeight / 2 + 40)
283
+ .attr('text-anchor', 'middle')
284
+ .text(`TP / (TP + FP) = ${TP} / ${totalPredicted}`);
285
+
286
+ g.append('text')
287
+ .attr('class', 'formula-text')
288
+ .attr('x', precisionX + boxWidth / 2)
289
+ .attr('y', formulaY - boxHeight / 2 + 60)
290
+ .attr('text-anchor', 'middle')
291
+ .style('font-weight', '700')
292
+ .style('font-size', '16px')
293
+ .attr('fill', '#4A90E2')
294
+ .text(`= ${(precision * 100).toFixed(1)}%`);
295
+
296
+ // Recall box
297
+ g.append('rect')
298
+ .attr('class', 'formula-box')
299
+ .attr('x', recallX)
300
+ .attr('y', formulaY - boxHeight / 2)
301
+ .attr('width', boxWidth)
302
+ .attr('height', boxHeight)
303
+ .attr('rx', 8);
304
+
305
+ g.append('text')
306
+ .attr('class', 'label')
307
+ .attr('x', recallX + boxWidth / 2)
308
+ .attr('y', formulaY - boxHeight / 2 + 20)
309
+ .attr('text-anchor', 'middle')
310
+ .attr('fill', '#F5A623')
311
+ .text('Recall');
312
+
313
+ g.append('text')
314
+ .attr('class', 'formula-text')
315
+ .attr('x', recallX + boxWidth / 2)
316
+ .attr('y', formulaY - boxHeight / 2 + 40)
317
+ .attr('text-anchor', 'middle')
318
+ .text(`TP / (TP + FN) = ${TP} / ${totalRelevant}`);
319
+
320
+ g.append('text')
321
+ .attr('class', 'formula-text')
322
+ .attr('x', recallX + boxWidth / 2)
323
+ .attr('y', formulaY - boxHeight / 2 + 60)
324
+ .attr('text-anchor', 'middle')
325
+ .style('font-weight', '700')
326
+ .style('font-size', '16px')
327
+ .attr('fill', '#F5A623')
328
+ .text(`= ${(recall * 100).toFixed(1)}%`);
329
+ }
330
+
331
+ render();
332
+
333
+ // Responsive handling
334
+ if (window.ResizeObserver) {
335
+ const ro = new ResizeObserver(() => render());
336
+ ro.observe(container);
337
+ } else {
338
+ window.addEventListener('resize', render);
339
+ }
340
+ };
341
+
342
+ if (document.readyState === 'loading') {
343
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
344
+ } else {
345
+ ensureD3(bootstrap);
346
+ }
347
+ })();
348
+ </script>
app/src/content/embeds/d3-text-metrics.html ADDED
@@ -0,0 +1,501 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-text-metrics"></div>
2
+
3
+ <style>
4
+ .d3-text-metrics {
5
+ font-family: var(--default-font-family);
6
+ background: transparent;
7
+ padding: 0;
8
+ width: 100%;
9
+ position: relative;
10
+ }
11
+
12
+ .d3-text-metrics .example-text {
13
+ font-size: 12px;
14
+ line-height: 1.8;
15
+ color: var(--text-color);
16
+ font-family: monospace;
17
+ margin: 8px 0;
18
+ padding: 10px 12px;
19
+ background: var(--surface-bg);
20
+ border: 1px solid var(--border-color);
21
+ border-radius: 6px;
22
+ }
23
+
24
+ .d3-text-metrics .label {
25
+ font-size: 10px;
26
+ font-weight: 700;
27
+ color: var(--muted-color);
28
+ margin-right: 8px;
29
+ }
30
+
31
+ .d3-text-metrics .metrics-grid {
32
+ display: grid;
33
+ grid-template-columns: repeat(3, 1fr);
34
+ gap: 12px;
35
+ margin: 16px 0;
36
+ }
37
+
38
+ .d3-text-metrics .metric-box {
39
+ padding: 12px;
40
+ background: var(--surface-bg);
41
+ border: 1px solid var(--border-color);
42
+ border-radius: 8px;
43
+ transition: border-color 0.2s;
44
+ }
45
+
46
+ .d3-text-metrics .metric-box:hover {
47
+ border-color: var(--primary-color);
48
+ }
49
+
50
+ .d3-text-metrics .metric-name {
51
+ font-size: 13px;
52
+ font-weight: 600;
53
+ color: var(--text-color);
54
+ margin-bottom: 6px;
55
+ }
56
+
57
+ .d3-text-metrics .metric-score {
58
+ font-size: 22px;
59
+ font-weight: 700;
60
+ color: var(--primary-color);
61
+ margin-bottom: 4px;
62
+ }
63
+
64
+ .d3-text-metrics .metric-detail {
65
+ font-size: 11px;
66
+ color: var(--muted-color);
67
+ line-height: 1.4;
68
+ }
69
+
70
+ .d3-text-metrics .visualization {
71
+ margin-top: 8px;
72
+ padding: 8px;
73
+ background: oklch(from var(--primary-color) calc(l + 0.45) c h / 0.06);
74
+ border-radius: 4px;
75
+ font-size: 10px;
76
+ }
77
+
78
+ [data-theme="dark"] .d3-text-metrics .visualization {
79
+ background: oklch(from var(--primary-color) calc(l + 0.20) c h / 0.1);
80
+ }
81
+
82
+ .d3-text-metrics .token {
83
+ display: inline-block;
84
+ padding: 2px 5px;
85
+ margin: 2px;
86
+ border-radius: 3px;
87
+ font-size: 10px;
88
+ background: var(--surface-bg);
89
+ border: 1px solid var(--border-color);
90
+ }
91
+
92
+ .d3-text-metrics .token.match {
93
+ background: oklch(from var(--primary-color) calc(l + 0.35) c h / 0.35);
94
+ border-color: var(--primary-color);
95
+ font-weight: 600;
96
+ }
97
+
98
+ [data-theme="dark"] .d3-text-metrics .token.match {
99
+ background: oklch(from var(--primary-color) calc(l + 0.25) c h / 0.4);
100
+ }
101
+
102
+ .d3-text-metrics .controls {
103
+ display: flex;
104
+ justify-content: center;
105
+ margin-bottom: 16px;
106
+ }
107
+
108
+ .d3-text-metrics select {
109
+ font-size: 12px;
110
+ padding: 6px 24px 6px 10px;
111
+ border: 1px solid var(--border-color);
112
+ border-radius: 6px;
113
+ background: var(--surface-bg);
114
+ color: var(--text-color);
115
+ cursor: pointer;
116
+ appearance: none;
117
+ background-image: url("data:image/svg+xml;charset=UTF-8,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 24 24' fill='none' stroke='currentColor' stroke-width='2' stroke-linecap='round' stroke-linejoin='round'%3e%3cpolyline points='6 9 12 15 18 9'%3e%3c/polyline%3e%3c/svg%3e");
118
+ background-repeat: no-repeat;
119
+ background-position: right 6px center;
120
+ background-size: 12px;
121
+ }
122
+
123
+ @media (max-width: 768px) {
124
+ .d3-text-metrics .metrics-grid {
125
+ grid-template-columns: 1fr;
126
+ }
127
+ }
128
+ </style>
129
+
130
+ <script>
131
+ (() => {
132
+ const bootstrap = () => {
133
+ const scriptEl = document.currentScript;
134
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
135
+ if (!(container && container.classList && container.classList.contains('d3-text-metrics'))) {
136
+ const candidates = Array.from(document.querySelectorAll('.d3-text-metrics'))
137
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
138
+ container = candidates[candidates.length - 1] || null;
139
+ }
140
+
141
+ if (!container) return;
142
+ if (container.dataset) {
143
+ if (container.dataset.mounted === 'true') return;
144
+ container.dataset.mounted = 'true';
145
+ }
146
+
147
+ // Single example: Cat Evaluator
148
+ const reference = "My cat loves doing model evaluation and testing benchmarks";
149
+ const prediction = "My cat enjoys model evaluation and testing models";
150
+
151
+ const tokenize = (text) => text.toLowerCase().trim().split(/\s+/);
152
+
153
+ const getNgrams = (tokens, n) => {
154
+ const ngrams = [];
155
+ for (let i = 0; i <= tokens.length - n; i++) {
156
+ ngrams.push(tokens.slice(i, i + n));
157
+ }
158
+ return ngrams;
159
+ };
160
+
161
+ const computeExactMatch = (pred, ref) => {
162
+ return pred.toLowerCase().trim() === ref.toLowerCase().trim() ? 1.0 : 0.0;
163
+ };
164
+
165
+ const computeBleu = (pred, ref) => {
166
+ const predTokens = tokenize(pred);
167
+ const refTokens = tokenize(ref);
168
+ if (predTokens.length === 0) return { score: 0, details: [] };
169
+
170
+ const details = [];
171
+ const precisions = [];
172
+
173
+ for (let n = 1; n <= 3; n++) {
174
+ const predNgrams = getNgrams(predTokens, n);
175
+ const refNgrams = getNgrams(refTokens, n);
176
+ if (predNgrams.length === 0) {
177
+ precisions.push(0);
178
+ continue;
179
+ }
180
+
181
+ const refCounts = {};
182
+ refNgrams.forEach(ng => {
183
+ const key = ng.join(' ');
184
+ refCounts[key] = (refCounts[key] || 0) + 1;
185
+ });
186
+
187
+ let matches = 0;
188
+ const matchedNgrams = [];
189
+ const predCounts = {};
190
+
191
+ predNgrams.forEach(ng => {
192
+ const key = ng.join(' ');
193
+ predCounts[key] = (predCounts[key] || 0) + 1;
194
+ if (refCounts[key] && predCounts[key] <= refCounts[key]) {
195
+ matches++;
196
+ if (!matchedNgrams.includes(key)) matchedNgrams.push(key);
197
+ }
198
+ });
199
+
200
+ const precision = matches / predNgrams.length;
201
+ precisions.push(precision);
202
+ details.push({ n, matches, total: predNgrams.length, matchedNgrams });
203
+ }
204
+
205
+ const validPrecisions = precisions.filter(p => p > 0);
206
+ const score = validPrecisions.length > 0
207
+ ? Math.exp(validPrecisions.reduce((sum, p) => sum + Math.log(p), 0) / validPrecisions.length)
208
+ : 0;
209
+
210
+ return { score, details };
211
+ };
212
+
213
+ const computeRouge1 = (pred, ref) => {
214
+ const predTokens = tokenize(pred);
215
+ const refTokens = tokenize(ref);
216
+
217
+ const predCounts = {};
218
+ const refCounts = {};
219
+ predTokens.forEach(t => predCounts[t] = (predCounts[t] || 0) + 1);
220
+ refTokens.forEach(t => refCounts[t] = (refCounts[t] || 0) + 1);
221
+
222
+ let overlap = 0;
223
+ const matchedTokens = [];
224
+ Object.keys(refCounts).forEach(token => {
225
+ if (predCounts[token]) {
226
+ overlap += Math.min(predCounts[token], refCounts[token]);
227
+ matchedTokens.push(token);
228
+ }
229
+ });
230
+
231
+ const recall = refTokens.length > 0 ? overlap / refTokens.length : 0;
232
+ const precision = predTokens.length > 0 ? overlap / predTokens.length : 0;
233
+ const f1 = (precision + recall) > 0 ? 2 * precision * recall / (precision + recall) : 0;
234
+
235
+ return { score: f1, recall, precision, matchedTokens };
236
+ };
237
+
238
+ const computeRouge2 = (pred, ref) => {
239
+ const predTokens = tokenize(pred);
240
+ const refTokens = tokenize(ref);
241
+
242
+ const predBigrams = getNgrams(predTokens, 2);
243
+ const refBigrams = getNgrams(refTokens, 2);
244
+
245
+ if (refBigrams.length === 0) {
246
+ return { score: 0, recall: 0, precision: 0, matchedBigrams: [] };
247
+ }
248
+
249
+ const predCounts = {};
250
+ const refCounts = {};
251
+ predBigrams.forEach(bg => {
252
+ const key = bg.join(' ');
253
+ predCounts[key] = (predCounts[key] || 0) + 1;
254
+ });
255
+ refBigrams.forEach(bg => {
256
+ const key = bg.join(' ');
257
+ refCounts[key] = (refCounts[key] || 0) + 1;
258
+ });
259
+
260
+ let overlap = 0;
261
+ const matchedBigrams = [];
262
+ Object.keys(refCounts).forEach(bigram => {
263
+ if (predCounts[bigram]) {
264
+ overlap += Math.min(predCounts[bigram], refCounts[bigram]);
265
+ matchedBigrams.push(bigram);
266
+ }
267
+ });
268
+
269
+ const recall = refBigrams.length > 0 ? overlap / refBigrams.length : 0;
270
+ const precision = predBigrams.length > 0 ? overlap / predBigrams.length : 0;
271
+ const f1 = (precision + recall) > 0 ? 2 * precision * recall / (precision + recall) : 0;
272
+
273
+ return { score: f1, recall, precision, matchedBigrams };
274
+ };
275
+
276
+ const computeEditDistanceWithOps = (s1, s2) => {
277
+ const m = s1.length;
278
+ const n = s2.length;
279
+
280
+ // Create DP table
281
+ const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
282
+
283
+ // Initialize
284
+ for (let i = 0; i <= m; i++) dp[i][0] = i;
285
+ for (let j = 0; j <= n; j++) dp[0][j] = j;
286
+
287
+ // Fill DP table
288
+ for (let i = 1; i <= m; i++) {
289
+ for (let j = 1; j <= n; j++) {
290
+ if (s1[i - 1] === s2[j - 1]) {
291
+ dp[i][j] = dp[i - 1][j - 1];
292
+ } else {
293
+ dp[i][j] = 1 + Math.min(
294
+ dp[i - 1][j], // delete
295
+ dp[i][j - 1], // insert
296
+ dp[i - 1][j - 1] // substitute
297
+ );
298
+ }
299
+ }
300
+ }
301
+
302
+ // Traceback to find operations
303
+ const operations = [];
304
+ let i = m, j = n;
305
+
306
+ while (i > 0 || j > 0) {
307
+ if (i === 0) {
308
+ operations.unshift({ type: 'insert', value: s2[j - 1], pos: j });
309
+ j--;
310
+ } else if (j === 0) {
311
+ operations.unshift({ type: 'delete', value: s1[i - 1], pos: i });
312
+ i--;
313
+ } else if (s1[i - 1] === s2[j - 1]) {
314
+ i--;
315
+ j--;
316
+ } else {
317
+ const deleteCost = dp[i - 1][j];
318
+ const insertCost = dp[i][j - 1];
319
+ const substituteCost = dp[i - 1][j - 1];
320
+
321
+ if (substituteCost <= deleteCost && substituteCost <= insertCost) {
322
+ operations.unshift({ type: 'substitute', from: s1[i - 1], to: s2[j - 1], pos: i });
323
+ i--;
324
+ j--;
325
+ } else if (deleteCost <= insertCost) {
326
+ operations.unshift({ type: 'delete', value: s1[i - 1], pos: i });
327
+ i--;
328
+ } else {
329
+ operations.unshift({ type: 'insert', value: s2[j - 1], pos: j });
330
+ j--;
331
+ }
332
+ }
333
+ }
334
+
335
+ return { distance: dp[m][n], operations };
336
+ };
337
+
338
+ const computeTer = (pred, ref) => {
339
+ const predTokens = tokenize(pred);
340
+ const refTokens = tokenize(ref);
341
+ const result = computeEditDistanceWithOps(predTokens, refTokens);
342
+ const score = refTokens.length > 0 ? result.distance / refTokens.length : 1.0;
343
+ return {
344
+ score,
345
+ edits: result.distance,
346
+ refLength: refTokens.length,
347
+ operations: result.operations
348
+ };
349
+ };
350
+
351
+ const computeBleurtMock = (pred, ref) => {
352
+ const predTokens = new Set(tokenize(pred));
353
+ const refTokens = new Set(tokenize(ref));
354
+ const intersection = new Set([...predTokens].filter(t => refTokens.has(t)));
355
+ const union = new Set([...predTokens, ...refTokens]);
356
+ const jaccard = union.size > 0 ? intersection.size / union.size : 0;
357
+ return { score: jaccard * 1.5 - 0.5, jaccard };
358
+ };
359
+
360
+ const render = () => {
361
+ const exactMatch = computeExactMatch(prediction, reference);
362
+ const bleu = computeBleu(prediction, reference);
363
+ const rouge1 = computeRouge1(prediction, reference);
364
+ const rouge2 = computeRouge2(prediction, reference);
365
+ const ter = computeTer(prediction, reference);
366
+ const bleurt = computeBleurtMock(prediction, reference);
367
+
368
+ container.innerHTML = `
369
+ <div class="example-text">
370
+ <span class="label">REF:</span>${reference}
371
+ </div>
372
+ <div class="example-text">
373
+ <span class="label">PRED:</span>${prediction}
374
+ </div>
375
+
376
+ <div class="metrics-grid">
377
+ <!-- Row 1: Exact Match, TER, BLEURT -->
378
+ <div class="metric-box">
379
+ <div class="metric-name">Exact Match</div>
380
+ <div class="metric-score">${exactMatch.toFixed(1)}</div>
381
+ <div class="metric-detail">Binary: 1 or 0</div>
382
+ <div class="visualization">
383
+ <div style="margin: 4px 0; font-size: 14px;">
384
+ ${exactMatch === 1 ? '✓ Strings are identical' : '✗ Strings differ'}
385
+ </div>
386
+ <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color);">
387
+ Most strict metric - no partial credit
388
+ </div>
389
+ </div>
390
+ </div>
391
+
392
+ <div class="metric-box">
393
+ <div class="metric-name">Translation Error Rate</div>
394
+ <div class="metric-score">${ter.score.toFixed(3)}</div>
395
+ <div class="metric-detail">Edit distance normalized</div>
396
+ <div class="visualization">
397
+ <div style="margin: 4px 0;">
398
+ <strong>${ter.edits}</strong> edits / <strong>${ter.refLength}</strong> words = <strong>${ter.score.toFixed(3)}</strong>
399
+ </div>
400
+ ${ter.operations.length > 0 ? `
401
+ <div style="margin-top: 8px; font-size: 10px;">
402
+ <div style="margin-bottom: 4px; color: var(--muted-color);">Edit operations:</div>
403
+ ${ter.operations.map((op, idx) => {
404
+ if (op.type === 'substitute') {
405
+ return `<div style="margin: 2px 0;">• Replace "<strong>${op.from}</strong>" → "<strong>${op.to}</strong>"</div>`;
406
+ } else if (op.type === 'delete') {
407
+ return `<div style="margin: 2px 0;">• Delete "<strong>${op.value}</strong>"</div>`;
408
+ } else if (op.type === 'insert') {
409
+ return `<div style="margin: 2px 0;">• Insert "<strong>${op.value}</strong>"</div>`;
410
+ }
411
+ }).join('')}
412
+ </div>
413
+ ` : ''}
414
+ <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color);">
415
+ Lower is better (0 = identical)
416
+ </div>
417
+ </div>
418
+ </div>
419
+
420
+ <div class="metric-box">
421
+ <div class="metric-name">BLEURT</div>
422
+ <div class="metric-score">${bleurt.score.toFixed(3)}</div>
423
+ <div class="metric-detail">Semantic similarity</div>
424
+ <div class="visualization">
425
+ <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color); font-style: italic;">
426
+ Note: Real BLEURT uses BERT embeddings trained on human judgments. This is a mock using Jaccard similarity.
427
+ </div>
428
+ </div>
429
+ </div>
430
+
431
+ <!-- Row 2: BLEU, ROUGE-1, ROUGE-2 -->
432
+ <div class="metric-box">
433
+ <div class="metric-name">BLEU</div>
434
+ <div class="metric-score">${bleu.score.toFixed(3)}</div>
435
+ <div class="metric-detail">N-gram precision-based</div>
436
+ <div class="visualization">
437
+ ${bleu.details.map(d => `
438
+ <div style="margin: 4px 0;">
439
+ <strong>${d.n}-gram:</strong> ${d.matches}/${d.total} (${(d.matches/d.total*100).toFixed(0)}%)
440
+ </div>
441
+ <div style="margin: 2px 0;">
442
+ ${d.matchedNgrams.slice(0, 3).map(ng => `<span class="token match">${ng}</span>`).join('')}
443
+ ${d.matchedNgrams.length > 3 ? `<span style="color: var(--muted-color); font-size: 10px;">+${d.matchedNgrams.length - 3} more</span>` : ''}
444
+ </div>
445
+ `).join('')}
446
+ </div>
447
+ </div>
448
+
449
+ <div class="metric-box">
450
+ <div class="metric-name">ROUGE-1</div>
451
+ <div class="metric-score">${rouge1.score.toFixed(3)}</div>
452
+ <div class="metric-detail">Unigram-based F1</div>
453
+ <div class="visualization">
454
+ <div style="margin: 4px 0;">
455
+ <strong>Recall:</strong> ${(rouge1.recall * 100).toFixed(0)}% | <strong>Precision:</strong> ${(rouge1.precision * 100).toFixed(0)}%
456
+ </div>
457
+ <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color);">
458
+ Matched unigrams:
459
+ </div>
460
+ ${rouge1.matchedTokens.length > 0 ? `
461
+ <div style="margin: 2px 0;">
462
+ ${rouge1.matchedTokens.slice(0, 5).map(t => `<span class="token match">${t}</span>`).join('')}
463
+ ${rouge1.matchedTokens.length > 5 ? `<span style="color: var(--muted-color); font-size: 10px;">+${rouge1.matchedTokens.length - 5} more</span>` : ''}
464
+ </div>
465
+ ` : ''}
466
+ </div>
467
+ </div>
468
+
469
+ <div class="metric-box">
470
+ <div class="metric-name">ROUGE-2</div>
471
+ <div class="metric-score">${rouge2.score.toFixed(3)}</div>
472
+ <div class="metric-detail">Bigram-based F1</div>
473
+ <div class="visualization">
474
+ <div style="margin: 4px 0;">
475
+ <strong>Recall:</strong> ${(rouge2.recall * 100).toFixed(0)}% | <strong>Precision:</strong> ${(rouge2.precision * 100).toFixed(0)}%
476
+ </div>
477
+ <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color);">
478
+ Matched bigrams:
479
+ </div>
480
+ ${rouge2.matchedBigrams.length > 0 ? `
481
+ <div style="margin: 2px 0;">
482
+ ${rouge2.matchedBigrams.slice(0, 3).map(bg => `<span class="token match">${bg}</span>`).join('')}
483
+ ${rouge2.matchedBigrams.length > 3 ? `<span style="color: var(--muted-color); font-size: 10px;">+${rouge2.matchedBigrams.length - 3} more</span>` : ''}
484
+ </div>
485
+ ` : '<div style="margin: 2px 0; font-size: 10px; color: var(--muted-color);">No bigram matches</div>'}
486
+ </div>
487
+ </div>
488
+ </div>
489
+ `;
490
+ };
491
+
492
+ render();
493
+ };
494
+
495
+ if (document.readyState === 'loading') {
496
+ document.addEventListener('DOMContentLoaded', bootstrap, { once: true });
497
+ } else {
498
+ bootstrap();
499
+ }
500
+ })();
501
+ </script>