evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 5 days ago

Commit

bb4414b

1 Parent(s): 0322f30

figures

Browse files

Files changed (6) hide show

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +7 -3
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx +1 -1
app/src/content/embeds/d3-binary-metrics.html +400 -0
app/src/content/embeds/d3-metrics-comparison.html +572 -0
app/src/content/embeds/d3-precision-recall.html +348 -0
app/src/content/embeds/d3-text-metrics.html +501 -0

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -4,6 +4,7 @@ title: "Designing your automatic evaluation"
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
 ### Dataset
@@ -114,10 +115,13 @@ When there is a ground truth, however, you can use automatic metrics, let's see
 #### Metrics
 Most ways to automatically compare a string of text to a reference are match based.
-The easiest but least flexible match based metrics are **exact matches** of token sequences. <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong.
-The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap.
-Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
 Once you have an accuracy score per sample, you can **aggregate** it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
+import HtmlEmbed from "../../../components/HtmlEmbed.astro";
 import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
 ### Dataset
 #### Metrics
 Most ways to automatically compare a string of text to a reference are match based.
+The easiest but least flexible match based metrics are **exact matches** of token sequences. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong.  <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>
+The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap. A simpler version of these is the **TER** (translation error rate), number of edits required to go from a prediction to the correct reference (similar to an edit distance).
+Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
+I'm introducing here the most well known metrics, but all of these metrics have variations and extensions, among which CorpusBLEU, GLEU, MAUVE, METEOR, to cite a few.
+<HtmlEmbed src="d3-text-metrics.html" frameless />
 Once you have an accuracy score per sample, you can **aggregate** it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).

app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED Viewed

@@ -136,7 +136,7 @@ Competitive bluffing games like [Poker](https://arxiv.org/html/2501.08328v1) (20
 What's also very neat about these is that they have a single and unambiguous pass/fail metric: did the LLM win the game or not? At the moment, if I were to use these to evaluate models I would probably look at TextQuests for abilities and Town of Salem for safety.
-Beyond testing capabilities in controlled environments, there's one type of evaluation that's inherently impossible to game: predicting the future. (Ok it's a tangent but I find these super fun and they could be relevant!)
 #### Forecasters
 In the last year, a new category of impossible to contaminate tasks emerged: forecasting. (I guess technically forecasting on the stock markets can be cheated on by some manipulation but hopefully we're not there yet in terms of financial incentives to mess up evals). They should require a combination of reasoning across sources to try to solve questions about not yet occuring events, but it's uncertain that these benchmarks are discriminative enough to have strong value, and they likely reinforce the "slot machine success" vibe of LLMs. (Is the performance on some events close to random because they are impossible to predict or because models are bad at it? In the other direction, if models are able to predict the event correctly, is the question too easy or too formulaic?)

 What's also very neat about these is that they have a single and unambiguous pass/fail metric: did the LLM win the game or not? At the moment, if I were to use these to evaluate models I would probably look at TextQuests for abilities and Town of Salem for safety.
+Beyond testing capabilities in controlled environments, people have explored the ultimate ungameable task: predicting the future.
 #### Forecasters
 In the last year, a new category of impossible to contaminate tasks emerged: forecasting. (I guess technically forecasting on the stock markets can be cheated on by some manipulation but hopefully we're not there yet in terms of financial incentives to mess up evals). They should require a combination of reasoning across sources to try to solve questions about not yet occuring events, but it's uncertain that these benchmarks are discriminative enough to have strong value, and they likely reinforce the "slot machine success" vibe of LLMs. (Is the performance on some events close to random because they are impossible to predict or because models are bad at it? In the other direction, if models are able to predict the event correctly, is the question too easy or too formulaic?)

app/src/content/embeds/d3-binary-metrics.html ADDED Viewed

	@@ -0,0 +1,400 @@

+<div class="d3-binary-metrics"></div>
+<style>
+  .d3-binary-metrics {
+    font-family: var(--default-font-family);
+    background: transparent;
+    border: none;
+    border-radius: 0;
+    padding: var(--spacing-4) 0;
+    width: 100%;
+    margin: 0 auto;
+  }
+  .d3-binary-metrics .metrics-container {
+    display: flex;
+    flex-direction: column;
+    gap: var(--spacing-4);
+  }
+  .d3-binary-metrics .confusion-matrix {
+    display: grid;
+    grid-template-columns: 100px 1fr 1fr;
+    grid-template-rows: 100px 1fr 1fr;
+    gap: 2px;
+    max-width: 400px;
+    margin: 0 auto;
+  }
+  .d3-binary-metrics .matrix-label {
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    font-size: 14px;
+    font-weight: 600;
+    color: var(--text-color);
+  }
+  .d3-binary-metrics .matrix-header-row {
+    grid-column: 1;
+    grid-row: 1;
+  }
+  .d3-binary-metrics .matrix-header-col {
+    grid-row: 1;
+    grid-column: 1;
+  }
+  .d3-binary-metrics .predicted-label {
+    grid-column: 2 / 4;
+    grid-row: 1;
+    font-size: 13px;
+    font-weight: 700;
+    color: var(--primary-color);
+    text-transform: uppercase;
+    letter-spacing: 0.05em;
+  }
+  .d3-binary-metrics .actual-label {
+    grid-column: 1;
+    grid-row: 2 / 4;
+    writing-mode: vertical-rl;
+    transform: rotate(180deg);
+    font-size: 13px;
+    font-weight: 700;
+    color: var(--primary-color);
+    text-transform: uppercase;
+    letter-spacing: 0.05em;
+  }
+  .d3-binary-metrics .matrix-pos-label {
+    grid-column: 2;
+    grid-row: 1;
+    font-size: 12px;
+    padding-bottom: 10px;
+  }
+  .d3-binary-metrics .matrix-neg-label {
+    grid-column: 3;
+    grid-row: 1;
+    font-size: 12px;
+    padding-bottom: 10px;
+  }
+  .d3-binary-metrics .matrix-pos-label-row {
+    grid-column: 1;
+    grid-row: 2;
+    font-size: 12px;
+    padding-right: 10px;
+  }
+  .d3-binary-metrics .matrix-neg-label-row {
+    grid-column: 1;
+    grid-row: 3;
+    font-size: 12px;
+    padding-right: 10px;
+  }
+  .d3-binary-metrics .matrix-cell {
+    display: flex;
+    flex-direction: column;
+    align-items: center;
+    justify-content: center;
+    padding: var(--spacing-3);
+    border-radius: 8px;
+    min-height: 100px;
+    border: 2px solid;
+    transition: all 0.3s ease;
+  }
+  .d3-binary-metrics .matrix-cell:hover {
+    transform: scale(1.05);
+    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
+  }
+  .d3-binary-metrics .cell-tp {
+    grid-column: 2;
+    grid-row: 2;
+    background: oklch(from var(--primary-color) calc(l + 0.35) calc(c * 0.8) h / 0.3);
+    border-color: oklch(from var(--primary-color) calc(l + 0.1) c h / 0.7);
+  }
+  .d3-binary-metrics .cell-fp {
+    grid-column: 3;
+    grid-row: 2;
+    background: oklch(from #ff6b6b calc(l + 0.35) c h / 0.25);
+    border-color: oklch(from #ff6b6b calc(l + 0.1) c h / 0.6);
+  }
+  .d3-binary-metrics .cell-fn {
+    grid-column: 2;
+    grid-row: 3;
+    background: oklch(from #ffa500 calc(l + 0.35) c h / 0.25);
+    border-color: oklch(from #ffa500 calc(l + 0.1) c h / 0.6);
+  }
+  .d3-binary-metrics .cell-tn {
+    grid-column: 3;
+    grid-row: 3;
+    background: oklch(from var(--primary-color) calc(l + 0.35) calc(c * 0.8) h / 0.3);
+    border-color: oklch(from var(--primary-color) calc(l + 0.1) c h / 0.7);
+  }
+  [data-theme="dark"] .d3-binary-metrics .cell-tp,
+  [data-theme="dark"] .d3-binary-metrics .cell-tn {
+    background: oklch(from var(--primary-color) calc(l + 0.25) calc(c * 0.8) h / 0.25);
+    border-color: oklch(from var(--primary-color) calc(l + 0.05) c h / 0.75);
+  }
+  [data-theme="dark"] .d3-binary-metrics .cell-fp {
+    background: oklch(from #ff6b6b calc(l + 0.25) c h / 0.2);
+    border-color: oklch(from #ff6b6b calc(l + 0.05) c h / 0.65);
+  }
+  [data-theme="dark"] .d3-binary-metrics .cell-fn {
+    background: oklch(from #ffa500 calc(l + 0.25) c h / 0.2);
+    border-color: oklch(from #ffa500 calc(l + 0.05) c h / 0.65);
+  }
+  .d3-binary-metrics .cell-label {
+    font-size: 11px;
+    font-weight: 700;
+    color: var(--text-color);
+    text-transform: uppercase;
+    letter-spacing: 0.05em;
+    margin-bottom: var(--spacing-1);
+  }
+  .d3-binary-metrics .cell-value {
+    font-size: 32px;
+    font-weight: 700;
+    color: var(--text-color);
+  }
+  .d3-binary-metrics .cell-description {
+    font-size: 10px;
+    color: var(--muted-color);
+    text-align: center;
+    margin-top: var(--spacing-1);
+  }
+  .d3-binary-metrics .metrics-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
+    gap: var(--spacing-3);
+    margin-top: var(--spacing-4);
+  }
+  .d3-binary-metrics .metric-card {
+    background: oklch(from var(--primary-color) calc(l + 0.42) c h / 0.25);
+    border: 1px solid oklch(from var(--primary-color) calc(l + 0.2) c h / 0.5);
+    border-radius: 12px;
+    padding: var(--spacing-4);
+    display: flex;
+    flex-direction: column;
+    gap: var(--spacing-2);
+  }
+  [data-theme="dark"] .d3-binary-metrics .metric-card {
+    background: oklch(from var(--primary-color) calc(l + 0.32) c h / 0.2);
+    border-color: oklch(from var(--primary-color) calc(l + 0.15) c h / 0.55);
+  }
+  .d3-binary-metrics .metric-name {
+    font-size: 15px;
+    font-weight: 700;
+    color: var(--primary-color);
+  }
+  [data-theme="dark"] .d3-binary-metrics .metric-name {
+    color: oklch(from var(--primary-color) calc(l + 0.05) calc(c * 1.1) h);
+  }
+  .d3-binary-metrics .metric-formula {
+    font-size: 13px;
+    color: var(--text-color);
+    font-family: monospace;
+    background: var(--surface-bg);
+    padding: var(--spacing-2);
+    border-radius: 6px;
+    border: 1px solid var(--border-color);
+  }
+  .d3-binary-metrics .metric-value {
+    font-size: 24px;
+    font-weight: 700;
+    color: var(--primary-color);
+    text-align: center;
+  }
+  .d3-binary-metrics .metric-interpretation {
+    font-size: 12px;
+    color: var(--muted-color);
+    line-height: 1.4;
+  }
+  .d3-binary-metrics .example-title {
+    font-size: 16px;
+    font-weight: 700;
+    color: var(--primary-color);
+    text-align: center;
+    margin-bottom: var(--spacing-3);
+  }
+  .d3-binary-metrics .example-description {
+    font-size: 13px;
+    color: var(--text-color);
+    text-align: center;
+    font-style: italic;
+    margin-bottom: var(--spacing-4);
+  }
+  @media (max-width: 768px) {
+    .d3-binary-metrics .confusion-matrix {
+      max-width: 100%;
+      grid-template-columns: 80px 1fr 1fr;
+      grid-template-rows: 80px 1fr 1fr;
+    }
+    .d3-binary-metrics .matrix-cell {
+      min-height: 80px;
+      padding: var(--spacing-2);
+    }
+    .d3-binary-metrics .cell-value {
+      font-size: 24px;
+    }
+    .d3-binary-metrics .metrics-grid {
+      grid-template-columns: 1fr;
+    }
+  }
+</style>
+<script>
+  (() => {
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-binary-metrics'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-binary-metrics'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Example: Question answering - checking if answer is correct
+      const TP = 45;  // Correctly identified as correct answer
+      const FP = 8;   // Incorrect answer marked as correct
+      const FN = 5;   // Correct answer marked as incorrect
+      const TN = 42;  // Correctly identified as incorrect answer
+      // Calculate metrics
+      const precision = TP / (TP + FP);
+      const recall = TP / (TP + FN);
+      const f1 = 2 * (precision * recall) / (precision + recall);
+      // MCC calculation
+      const numerator = (TP * TN) - (FP * FN);
+      const denominator = Math.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN));
+      const mcc = numerator / denominator;
+      container.innerHTML = `
+        <div class="metrics-container">
+          <div class="example-title">Binary Classification Metrics Example</div>
+          <div class="example-description">
+            Question Answering: 100 model predictions evaluated (50 correct, 50 incorrect)
+          </div>
+          <div class="confusion-matrix">
+            <div class="matrix-label predicted-label">Predicted</div>
+            <div class="matrix-label actual-label">Actual</div>
+            <div class="matrix-label matrix-pos-label">Correct</div>
+            <div class="matrix-label matrix-neg-label">Incorrect</div>
+            <div class="matrix-label matrix-pos-label-row">Correct</div>
+            <div class="matrix-label matrix-neg-label-row">Incorrect</div>
+            <div class="matrix-cell cell-tp">
+              <div class="cell-label">True Positive</div>
+              <div class="cell-value">${TP}</div>
+              <div class="cell-description">Correct answer identified as correct</div>
+            </div>
+            <div class="matrix-cell cell-fp">
+              <div class="cell-label">False Positive</div>
+              <div class="cell-value">${FP}</div>
+              <div class="cell-description">Incorrect answer marked as correct</div>
+            </div>
+            <div class="matrix-cell cell-fn">
+              <div class="cell-label">False Negative</div>
+              <div class="cell-value">${FN}</div>
+              <div class="cell-description">Correct answer marked as incorrect</div>
+            </div>
+            <div class="matrix-cell cell-tn">
+              <div class="cell-label">True Negative</div>
+              <div class="cell-value">${TN}</div>
+              <div class="cell-description">Incorrect answer identified as incorrect</div>
+            </div>
+          </div>
+          <div class="metrics-grid">
+            <div class="metric-card">
+              <div class="metric-name">Precision</div>
+              <div class="metric-formula">TP / (TP + FP)</div>
+              <div class="metric-value">${precision.toFixed(3)}</div>
+              <div class="metric-interpretation">
+                ${(precision * 100).toFixed(1)}% of answers marked correct are actually correct.
+                Critical when false positives (wrong answers accepted) are costly.
+              </div>
+            </div>
+            <div class="metric-card">
+              <div class="metric-name">Recall</div>
+              <div class="metric-formula">TP / (TP + FN)</div>
+              <div class="metric-value">${recall.toFixed(3)}</div>
+              <div class="metric-interpretation">
+                ${(recall * 100).toFixed(1)}% of actually correct answers were identified.
+                Critical when missing positives (rejecting correct answers) is costly.
+              </div>
+            </div>
+            <div class="metric-card">
+              <div class="metric-name">F1 Score</div>
+              <div class="metric-formula">2 × (P × R) / (P + R)</div>
+              <div class="metric-value">${f1.toFixed(3)}</div>
+              <div class="metric-interpretation">
+                Harmonic mean of precision and recall.
+                Balances both metrics, good for imbalanced data.
+              </div>
+            </div>
+            <div class="metric-card">
+              <div class="metric-name">MCC</div>
+              <div class="metric-formula">(TP×TN - FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))</div>
+              <div class="metric-value">${mcc.toFixed(3)}</div>
+              <div class="metric-interpretation">
+                Matthews Correlation Coefficient ranges from -1 to +1.
+                Works well with imbalanced datasets.
+              </div>
+            </div>
+          </div>
+        </div>
+      `;
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', bootstrap, { once: true });
+    } else {
+      bootstrap();
+    }
+  })();
+</script>

app/src/content/embeds/d3-metrics-comparison.html ADDED Viewed

	@@ -0,0 +1,572 @@

+<div class="d3-metrics-comparison"></div>
+<style>
+  .d3-metrics-comparison {
+    font-family: var(--default-font-family);
+    background: transparent;
+    border: none;
+    border-radius: 0;
+    padding: var(--spacing-4) 0;
+    width: 100%;
+    margin: 0 auto;
+    position: relative;
+  }
+  .d3-metrics-comparison svg {
+    width: 100%;
+    height: auto;
+    display: block;
+  }
+  .d3-metrics-comparison .node-rect {
+    stroke-width: 2;
+    transition: all 0.3s ease;
+  }
+  .d3-metrics-comparison .node-rect:hover {
+    filter: brightness(1.1);
+    stroke-width: 3;
+  }
+  .d3-metrics-comparison .input-node {
+    fill: oklch(from var(--primary-color) calc(l + 0.42) c h / 0.35);
+    stroke: oklch(from var(--primary-color) calc(l + 0.1) c h / 0.7);
+  }
+  .d3-metrics-comparison .method-node {
+    fill: oklch(from var(--primary-color) calc(l + 0.38) c h / 0.45);
+    stroke: var(--primary-color);
+  }
+  .d3-metrics-comparison .score-node {
+    fill: oklch(from var(--primary-color) calc(l + 0.35) c h / 0.55);
+    stroke: oklch(from var(--primary-color) calc(l - 0.05) calc(c * 1.2) h);
+  }
+  [data-theme="dark"] .d3-metrics-comparison .input-node {
+    fill: oklch(from var(--primary-color) calc(l + 0.32) c h / 0.3);
+    stroke: oklch(from var(--primary-color) calc(l + 0.05) c h / 0.75);
+  }
+  [data-theme="dark"] .d3-metrics-comparison .method-node {
+    fill: oklch(from var(--primary-color) calc(l + 0.28) c h / 0.4);
+    stroke: oklch(from var(--primary-color) calc(l + 0.05) calc(c * 1.1) h);
+  }
+  [data-theme="dark"] .d3-metrics-comparison .score-node {
+    fill: oklch(from var(--primary-color) calc(l + 0.25) c h / 0.5);
+    stroke: oklch(from var(--primary-color) calc(l) calc(c * 1.3) h);
+  }
+  .d3-metrics-comparison .node-label {
+    fill: var(--text-color);
+    font-size: 13px;
+    font-weight: 600;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-metrics-comparison .node-sublabel {
+    fill: var(--muted-color);
+    font-size: 10px;
+    font-weight: 500;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-metrics-comparison .node-example {
+    fill: var(--text-color);
+    font-size: 10px;
+    font-weight: 500;
+    font-style: italic;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-metrics-comparison .link-path {
+    fill: none;
+    stroke: oklch(from var(--primary-color) l c h / 0.4);
+    stroke-width: 2;
+    transition: all 0.3s ease;
+  }
+  [data-theme="dark"] .d3-metrics-comparison .link-path {
+    stroke: oklch(from var(--primary-color) l c h / 0.5);
+  }
+  .d3-metrics-comparison .link-path:hover {
+    stroke: var(--primary-color);
+    stroke-width: 3;
+  }
+  .d3-metrics-comparison .link-label {
+    fill: var(--text-color);
+    font-size: 10px;
+    font-weight: 600;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-metrics-comparison .score-badge {
+    fill: var(--primary-color);
+    font-size: 14px;
+    font-weight: 700;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-metrics-comparison .score-badge-bg {
+    fill: var(--surface-bg);
+    stroke: var(--primary-color);
+    stroke-width: 2;
+  }
+  .d3-metrics-comparison .section-title {
+    fill: var(--primary-color);
+    font-size: 12px;
+    font-weight: 700;
+    text-transform: uppercase;
+    letter-spacing: 0.05em;
+  }
+  [data-theme="dark"] .d3-metrics-comparison .section-title {
+    fill: oklch(from var(--primary-color) calc(l + 0.1) calc(c * 1.2) h);
+  }
+  .d3-metrics-comparison .marker {
+    fill: oklch(from var(--primary-color) l c h / 0.6);
+  }
+  .d3-metrics-comparison .tooltip {
+    position: absolute;
+    background: var(--surface-bg);
+    border: 1px solid var(--border-color);
+    border-radius: 8px;
+    padding: 10px 14px;
+    font-size: 12px;
+    pointer-events: none;
+    opacity: 0;
+    transition: opacity 0.2s ease;
+    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
+    z-index: 1000;
+    max-width: 350px;
+    line-height: 1.5;
+    white-space: pre-line;
+    color: var(--text-color);
+  }
+  .d3-metrics-comparison .tooltip.visible {
+    opacity: 1;
+  }
+  @media (max-width: 768px) {
+    .d3-metrics-comparison .node-label {
+      font-size: 11px;
+    }
+    .d3-metrics-comparison .node-sublabel {
+      font-size: 9px;
+    }
+    .d3-metrics-comparison .node-example {
+      font-size: 9px;
+    }
+    .d3-metrics-comparison .link-label {
+      font-size: 9px;
+    }
+    .d3-metrics-comparison .score-badge {
+      font-size: 12px;
+    }
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => {
+        if (window.d3 && typeof window.d3.select === 'function') cb();
+      };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-metrics-comparison'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-metrics-comparison'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      container.style.position = 'relative';
+      // Tooltip
+      const tooltip = document.createElement('div');
+      tooltip.className = 'tooltip';
+      container.appendChild(tooltip);
+      // Data structure: inputs -> methods -> scores
+      const data = {
+        inputs: [
+          {
+            id: 'prediction',
+            label: 'Prediction',
+            sublabel: '(model output)',
+            example: '"Evaluation is an amazing topic"'
+          },
+          {
+            id: 'reference',
+            label: 'Reference',
+            sublabel: '(ground truth)',
+            example: '"Evaluation is amazing"'
+          }
+        ],
+        methods: [
+          {
+            id: 'exact',
+            label: 'Exact Match',
+            sublabel: 'token sequences',
+            score: '0',
+            description: 'Strings don\'t match exactly—missing words "an" and "topic"',
+            scoreType: 'binary'
+          },
+          {
+            id: 'bleu',
+            label: 'BLEU',
+            sublabel: 'n-gram overlap',
+            score: '0.13',
+            description: 'Actual BLEU computation:\n• BLEU-1 (unigrams): 0.60 (3/5 match)\n• BLEU-2 (bigrams): 0.39 (1/4 match)\n• BLEU-3 (trigrams): 0.17 (0/3 match)\n• Final BLEU (geometric mean): 0.13\n• Brevity penalty reduces score (prediction > reference)',
+            scoreType: 'continuous'
+          },
+          {
+            id: 'rouge',
+            label: 'ROUGE',
+            sublabel: 'recall-oriented',
+            score: '0.75',
+            description: 'ROUGE-1 (unigram) scores:\n• Recall: 3/3 = 100% (all reference words found in prediction)\n• Precision: 3/5 = 60% (prediction words in reference)\n• F1 score: 0.75\nReference: ["evaluation", "is", "amazing"]',
+            scoreType: 'continuous'
+          },
+          {
+            id: 'bleurt',
+            label: 'BLEURT',
+            sublabel: 'semantic similarity',
+            score: '0.85',
+            description: 'High semantic similarity—both express positive sentiment about evaluation',
+            scoreType: 'continuous'
+          }
+        ],
+        scores: [
+          {
+            id: 'binary',
+            label: 'Binary Score',
+            sublabel: 'correct/incorrect'
+          },
+          {
+            id: 'continuous',
+            label: 'Continuous Score',
+            sublabel: '0.0 to 1.0'
+          }
+        ]
+      };
+      const svg = d3.select(container).append('svg');
+      const g = svg.append('g');
+      // Arrow marker
+      svg.append('defs').append('marker')
+        .attr('id', 'arrowhead')
+        .attr('viewBox', '0 -5 10 10')
+        .attr('refX', 8)
+        .attr('refY', 0)
+        .attr('markerWidth', 6)
+        .attr('markerHeight', 6)
+        .attr('orient', 'auto')
+        .append('path')
+        .attr('d', 'M0,-5L10,0L0,5')
+        .attr('class', 'marker');
+      let width = 800;
+      let height = 500;
+      function wrapText(text, maxWidth) {
+        const words = text.split(' ');
+        const lines = [];
+        let currentLine = words[0];
+        for (let i = 1; i < words.length; i++) {
+          const word = words[i];
+          const testLine = currentLine + ' ' + word;
+          if (testLine.length * 6 < maxWidth) {
+            currentLine = testLine;
+          } else {
+            lines.push(currentLine);
+            currentLine = word;
+          }
+        }
+        lines.push(currentLine);
+        return lines;
+      }
+      function render() {
+        width = container.clientWidth || 800;
+        height = Math.max(500, Math.round(width * 0.7));
+        svg.attr('width', width).attr('height', height);
+        const margin = { top: 40, right: 20, bottom: 20, left: 20 };
+        const innerWidth = width - margin.left - margin.right;
+        const innerHeight = height - margin.top - margin.bottom;
+        g.attr('transform', `translate(${margin.left},${margin.top})`);
+        // Clear previous content
+        g.selectAll('*').remove();
+        // Column positions with increased horizontal spacing
+        const nodeWidth = Math.min(150, innerWidth * 0.2);
+        const nodeHeight = 85;
+        const gapBetweenColumns = Math.max(80, innerWidth * 0.15);
+        // Calculate column centers with larger gaps
+        const col1X = nodeWidth / 2 + 20;
+        const col2X = col1X + nodeWidth / 2 + gapBetweenColumns + nodeWidth / 2;
+        const col3X = col2X + nodeWidth / 2 + gapBetweenColumns + nodeWidth / 2;
+        // Section titles
+        g.selectAll('.section-title')
+          .data([
+            { x: col1X, label: 'INPUTS' },
+            { x: col2X, label: 'COMPARISON METHODS' },
+            { x: col3X, label: 'SCORES' }
+          ])
+          .join('text')
+          .attr('class', 'section-title')
+          .attr('x', d => d.x)
+          .attr('y', -15)
+          .attr('text-anchor', 'middle')
+          .text(d => d.label);
+        // Calculate positions
+        const inputY = innerHeight * 0.25;
+        const methodStartY = 40;
+        const methodSpacing = (innerHeight - methodStartY - nodeHeight) / (data.methods.length - 1);
+        // Position score nodes to align with specific methods
+        // Binary score aligns with Exact Match (index 0)
+        // Continuous score aligns with ROUGE (index 2)
+        const exactMatchY = methodStartY + 0 * methodSpacing;
+        const rougeY = methodStartY + 2 * methodSpacing;
+        // Position nodes
+        const inputNodes = data.inputs.map((d, i) => ({
+          ...d,
+          x: col1X - nodeWidth / 2,
+          y: inputY + i * (nodeHeight + 30),
+          width: nodeWidth,
+          height: nodeHeight,
+          type: 'input'
+        }));
+        const methodNodes = data.methods.map((d, i) => ({
+          ...d,
+          x: col2X - nodeWidth / 2,
+          y: methodStartY + i * methodSpacing,
+          width: nodeWidth,
+          height: nodeHeight,
+          type: 'method'
+        }));
+        const scoreNodes = data.scores.map((d, i) => {
+          // Binary score aligns with Exact Match, Continuous with ROUGE
+          const yPos = d.id === 'binary' ? exactMatchY : rougeY;
+          return {
+            ...d,
+            x: col3X - nodeWidth / 2,
+            y: yPos,
+            width: nodeWidth,
+            height: nodeHeight,
+            type: 'score'
+          };
+        });
+        const allNodes = [...inputNodes, ...methodNodes, ...scoreNodes];
+        // Create links: inputs -> methods -> scores
+        const links = [];
+        // Each input connects to all methods
+        inputNodes.forEach(input => {
+          methodNodes.forEach(method => {
+            links.push({
+              source: input,
+              target: method,
+              type: 'input-method'
+            });
+          });
+        });
+        // Each method connects to appropriate score type
+        methodNodes.forEach(method => {
+          const targetScore = scoreNodes.find(s => s.id === method.scoreType);
+          if (targetScore) {
+            links.push({
+              source: method,
+              target: targetScore,
+              type: 'method-score',
+              score: method.score
+            });
+          }
+        });
+        // Draw links
+        const linkGroup = g.append('g').attr('class', 'links');
+        linkGroup.selectAll('.link-path')
+          .data(links)
+          .join('path')
+          .attr('class', 'link-path')
+          .attr('d', d => {
+            const sx = d.source.x + d.source.width;
+            const sy = d.source.y + d.source.height / 2;
+            const tx = d.target.x;
+            const ty = d.target.y + d.target.height / 2;
+            const mx = (sx + tx) / 2;
+            return `M ${sx} ${sy} C ${mx} ${sy}, ${mx} ${ty}, ${tx} ${ty}`;
+          })
+          .attr('marker-end', 'url(#arrowhead)');
+        // Add score badges on method->score links
+        const scoreBadges = linkGroup.selectAll('.score-badge-group')
+          .data(links.filter(d => d.type === 'method-score'))
+          .join('g')
+          .attr('class', 'score-badge-group')
+          .attr('transform', d => {
+            const sx = d.source.x + d.source.width;
+            const sy = d.source.y + d.source.height / 2;
+            const tx = d.target.x;
+            const ty = d.target.y + d.target.height / 2;
+            const mx = (sx + tx) / 2;
+            const my = (sy + ty) / 2;
+            return `translate(${mx}, ${my})`;
+          });
+        scoreBadges.append('rect')
+          .attr('class', 'score-badge-bg')
+          .attr('x', -20)
+          .attr('y', -12)
+          .attr('width', 40)
+          .attr('height', 24)
+          .attr('rx', 6);
+        scoreBadges.append('text')
+          .attr('class', 'score-badge')
+          .attr('text-anchor', 'middle')
+          .attr('dominant-baseline', 'middle')
+          .text(d => d.score);
+        // Draw nodes
+        const nodeGroup = g.append('g').attr('class', 'nodes');
+        const nodes = nodeGroup.selectAll('.node')
+          .data(allNodes)
+          .join('g')
+          .attr('class', 'node')
+          .attr('transform', d => `translate(${d.x},${d.y})`)
+          .style('cursor', 'pointer');
+        nodes.append('rect')
+          .attr('class', d => `node-rect ${d.type}-node`)
+          .attr('width', d => d.width)
+          .attr('height', d => d.height)
+          .attr('rx', 8)
+          .on('mouseenter', function(event, d) {
+            if (d.description) {
+              tooltip.textContent = d.description;
+              tooltip.classList.add('visible');
+              const rect = container.getBoundingClientRect();
+              tooltip.style.left = (event.clientX - rect.left + 10) + 'px';
+              tooltip.style.top = (event.clientY - rect.top + 10) + 'px';
+            }
+          })
+          .on('mouseleave', function() {
+            tooltip.classList.remove('visible');
+          });
+        nodes.append('text')
+          .attr('class', 'node-label')
+          .attr('x', d => d.width / 2)
+          .attr('y', 18)
+          .attr('text-anchor', 'middle')
+          .text(d => d.label);
+        nodes.append('text')
+          .attr('class', 'node-sublabel')
+          .attr('x', d => d.width / 2)
+          .attr('y', 32)
+          .attr('text-anchor', 'middle')
+          .text(d => d.sublabel);
+        // Add example text to input nodes
+        nodes.filter(d => d.type === 'input' && d.example)
+          .each(function(d) {
+            const node = d3.select(this);
+            const lines = wrapText(d.example, d.width - 16);
+            lines.forEach((line, i) => {
+              node.append('text')
+                .attr('class', 'node-example')
+                .attr('x', d.width / 2)
+                .attr('y', 48 + i * 12)
+                .attr('text-anchor', 'middle')
+                .text(line);
+            });
+          });
+        // Score is shown on the arrows, not in the method nodes
+        // Add aggregation info to score nodes
+        nodes.filter(d => d.type === 'score' && d.aggregations)
+          .append('text')
+          .attr('class', 'node-sublabel')
+          .attr('x', d => d.width / 2)
+          .attr('y', d => d.height - 12)
+          .attr('text-anchor', 'middle')
+          .attr('font-size', '9px')
+          .text(d => `${d.aggregations.slice(0, 2).join(', ')}...`);
+      }
+      render();
+      // Responsive handling
+      if (window.ResizeObserver) {
+        const ro = new ResizeObserver(() => render());
+        ro.observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/d3-precision-recall.html ADDED Viewed

	@@ -0,0 +1,348 @@

+<div class="d3-precision-recall"></div>
+<style>
+  .d3-precision-recall {
+    font-family: var(--default-font-family);
+    background: transparent;
+    border: none;
+    border-radius: 0;
+    padding: var(--spacing-4) 0;
+    width: 100%;
+    margin: 0 auto;
+  }
+  .d3-precision-recall svg {
+    width: 100%;
+    height: auto;
+    display: block;
+  }
+  .d3-precision-recall .circle {
+    fill: none;
+    stroke-width: 3;
+    opacity: 0.8;
+  }
+  .d3-precision-recall .circle-predicted {
+    stroke: #4A90E2;
+    fill: #4A90E2;
+    fill-opacity: 0.15;
+  }
+  .d3-precision-recall .circle-relevant {
+    stroke: #F5A623;
+    fill: #F5A623;
+    fill-opacity: 0.15;
+  }
+  .d3-precision-recall .intersection {
+    fill: #7ED321;
+    fill-opacity: 0.3;
+  }
+  [data-theme="dark"] .d3-precision-recall .circle-predicted {
+    stroke: #5DA9FF;
+    fill: #5DA9FF;
+  }
+  [data-theme="dark"] .d3-precision-recall .circle-relevant {
+    stroke: #FFB84D;
+    fill: #FFB84D;
+  }
+  [data-theme="dark"] .d3-precision-recall .intersection {
+    fill: #94E842;
+  }
+  .d3-precision-recall .label {
+    font-size: 14px;
+    font-weight: 600;
+    fill: var(--text-color);
+  }
+  .d3-precision-recall .count-label {
+    font-size: 13px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-precision-recall .formula-text {
+    font-size: 12px;
+    fill: var(--text-color);
+  }
+  .d3-precision-recall .formula-box {
+    fill: var(--surface-bg);
+    stroke: var(--border-color);
+    stroke-width: 1;
+  }
+  .d3-precision-recall .section-title {
+    font-size: 16px;
+    font-weight: 700;
+    fill: var(--primary-color);
+    text-anchor: middle;
+  }
+  .d3-precision-recall .legend-text {
+    font-size: 11px;
+    fill: var(--text-color);
+  }
+  .d3-precision-recall .legend-rect {
+    stroke-width: 1.5;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => {
+        if (window.d3 && typeof window.d3.select === 'function') cb();
+      };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-precision-recall'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-precision-recall'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      const svg = d3.select(container).append('svg');
+      const g = svg.append('g');
+      let width = 800;
+      let height = 500;
+      function render() {
+        width = container.clientWidth || 800;
+        height = Math.max(400, Math.round(width * 0.5));
+        svg.attr('width', width).attr('height', height);
+        const margin = { top: 40, right: 40, bottom: 80, left: 40 };
+        const innerWidth = width - margin.left - margin.right;
+        const innerHeight = height - margin.top - margin.bottom;
+        g.attr('transform', `translate(${margin.left},${margin.top})`);
+        g.selectAll('*').remove();
+        // Example: Question answering with exact match
+        const TP = 45;  // True Positives (correct answers identified)
+        const FP = 8;   // False Positives (incorrect marked as correct)
+        const FN = 5;   // False Negatives (correct marked as incorrect)
+        const totalPredicted = TP + FP;  // All predicted as correct
+        const totalRelevant = TP + FN;   // All actually correct
+        const precision = TP / totalPredicted;
+        const recall = TP / totalRelevant;
+        // Circle parameters
+        const radius = Math.min(innerWidth, innerHeight) * 0.25;
+        const overlapOffset = radius * 0.6;
+        const predictedX = innerWidth * 0.35;
+        const relevantX = predictedX + overlapOffset;
+        const centerY = innerHeight * 0.4;
+        // Title
+        g.append('text')
+          .attr('class', 'section-title')
+          .attr('x', innerWidth / 2)
+          .attr('y', -10)
+          .text('Precision and Recall Visualization');
+        // Draw circles
+        g.append('circle')
+          .attr('class', 'circle circle-predicted')
+          .attr('cx', predictedX)
+          .attr('cy', centerY)
+          .attr('r', radius);
+        g.append('circle')
+          .attr('class', 'circle circle-relevant')
+          .attr('cx', relevantX)
+          .attr('cy', centerY)
+          .attr('r', radius);
+        // Calculate intersection area (approximate)
+        const intersectionX = (predictedX + relevantX) / 2;
+        // Draw intersection highlight
+        const clipId = 'clip-intersection-' + Math.random().toString(36).substr(2, 9);
+        const defs = g.append('defs');
+        const clipPath = defs.append('clipPath').attr('id', clipId);
+        clipPath.append('circle')
+          .attr('cx', predictedX)
+          .attr('cy', centerY)
+          .attr('r', radius);
+        g.append('circle')
+          .attr('class', 'intersection')
+          .attr('cx', relevantX)
+          .attr('cy', centerY)
+          .attr('r', radius)
+          .attr('clip-path', `url(#${clipId})`);
+        // Labels for circles
+        g.append('text')
+          .attr('class', 'label')
+          .attr('x', predictedX - radius * 0.7)
+          .attr('y', centerY - radius - 15)
+          .attr('text-anchor', 'middle')
+          .text('Predicted Correct');
+        g.append('text')
+          .attr('class', 'label')
+          .attr('x', relevantX + radius * 0.7)
+          .attr('y', centerY - radius - 15)
+          .attr('text-anchor', 'middle')
+          .text('Actually Correct');
+        // Count labels inside circles
+        // Left part (FP)
+        g.append('text')
+          .attr('class', 'count-label')
+          .attr('x', predictedX - radius * 0.5)
+          .attr('y', centerY)
+          .attr('text-anchor', 'middle')
+          .attr('fill', '#4A90E2')
+          .text(`FP: ${FP}`);
+        // Intersection (TP)
+        g.append('text')
+          .attr('class', 'count-label')
+          .attr('x', intersectionX)
+          .attr('y', centerY)
+          .attr('text-anchor', 'middle')
+          .attr('fill', '#7ED321')
+          .style('font-weight', '700')
+          .text(`TP: ${TP}`);
+        // Right part (FN)
+        g.append('text')
+          .attr('class', 'count-label')
+          .attr('x', relevantX + radius * 0.5)
+          .attr('y', centerY)
+          .attr('text-anchor', 'middle')
+          .attr('fill', '#F5A623')
+          .text(`FN: ${FN}`);
+        // Formula boxes at bottom
+        const formulaY = centerY + radius + 60;
+        const boxWidth = Math.min(200, innerWidth * 0.35);
+        const boxHeight = 80;
+        const boxGap = 40;
+        const precisionX = innerWidth * 0.3 - boxWidth / 2;
+        const recallX = innerWidth * 0.7 - boxWidth / 2;
+        // Precision box
+        g.append('rect')
+          .attr('class', 'formula-box')
+          .attr('x', precisionX)
+          .attr('y', formulaY - boxHeight / 2)
+          .attr('width', boxWidth)
+          .attr('height', boxHeight)
+          .attr('rx', 8);
+        g.append('text')
+          .attr('class', 'label')
+          .attr('x', precisionX + boxWidth / 2)
+          .attr('y', formulaY - boxHeight / 2 + 20)
+          .attr('text-anchor', 'middle')
+          .attr('fill', '#4A90E2')
+          .text('Precision');
+        g.append('text')
+          .attr('class', 'formula-text')
+          .attr('x', precisionX + boxWidth / 2)
+          .attr('y', formulaY - boxHeight / 2 + 40)
+          .attr('text-anchor', 'middle')
+          .text(`TP / (TP + FP) = ${TP} / ${totalPredicted}`);
+        g.append('text')
+          .attr('class', 'formula-text')
+          .attr('x', precisionX + boxWidth / 2)
+          .attr('y', formulaY - boxHeight / 2 + 60)
+          .attr('text-anchor', 'middle')
+          .style('font-weight', '700')
+          .style('font-size', '16px')
+          .attr('fill', '#4A90E2')
+          .text(`= ${(precision * 100).toFixed(1)}%`);
+        // Recall box
+        g.append('rect')
+          .attr('class', 'formula-box')
+          .attr('x', recallX)
+          .attr('y', formulaY - boxHeight / 2)
+          .attr('width', boxWidth)
+          .attr('height', boxHeight)
+          .attr('rx', 8);
+        g.append('text')
+          .attr('class', 'label')
+          .attr('x', recallX + boxWidth / 2)
+          .attr('y', formulaY - boxHeight / 2 + 20)
+          .attr('text-anchor', 'middle')
+          .attr('fill', '#F5A623')
+          .text('Recall');
+        g.append('text')
+          .attr('class', 'formula-text')
+          .attr('x', recallX + boxWidth / 2)
+          .attr('y', formulaY - boxHeight / 2 + 40)
+          .attr('text-anchor', 'middle')
+          .text(`TP / (TP + FN) = ${TP} / ${totalRelevant}`);
+        g.append('text')
+          .attr('class', 'formula-text')
+          .attr('x', recallX + boxWidth / 2)
+          .attr('y', formulaY - boxHeight / 2 + 60)
+          .attr('text-anchor', 'middle')
+          .style('font-weight', '700')
+          .style('font-size', '16px')
+          .attr('fill', '#F5A623')
+          .text(`= ${(recall * 100).toFixed(1)}%`);
+      }
+      render();
+      // Responsive handling
+      if (window.ResizeObserver) {
+        const ro = new ResizeObserver(() => render());
+        ro.observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/d3-text-metrics.html ADDED Viewed

	@@ -0,0 +1,501 @@

+<div class="d3-text-metrics"></div>
+<style>
+  .d3-text-metrics {
+    font-family: var(--default-font-family);
+    background: transparent;
+    padding: 0;
+    width: 100%;
+    position: relative;
+  }
+  .d3-text-metrics .example-text {
+    font-size: 12px;
+    line-height: 1.8;
+    color: var(--text-color);
+    font-family: monospace;
+    margin: 8px 0;
+    padding: 10px 12px;
+    background: var(--surface-bg);
+    border: 1px solid var(--border-color);
+    border-radius: 6px;
+  }
+  .d3-text-metrics .label {
+    font-size: 10px;
+    font-weight: 700;
+    color: var(--muted-color);
+    margin-right: 8px;
+  }
+  .d3-text-metrics .metrics-grid {
+    display: grid;
+    grid-template-columns: repeat(3, 1fr);
+    gap: 12px;
+    margin: 16px 0;
+  }
+  .d3-text-metrics .metric-box {
+    padding: 12px;
+    background: var(--surface-bg);
+    border: 1px solid var(--border-color);
+    border-radius: 8px;
+    transition: border-color 0.2s;
+  }
+  .d3-text-metrics .metric-box:hover {
+    border-color: var(--primary-color);
+  }
+  .d3-text-metrics .metric-name {
+    font-size: 13px;
+    font-weight: 600;
+    color: var(--text-color);
+    margin-bottom: 6px;
+  }
+  .d3-text-metrics .metric-score {
+    font-size: 22px;
+    font-weight: 700;
+    color: var(--primary-color);
+    margin-bottom: 4px;
+  }
+  .d3-text-metrics .metric-detail {
+    font-size: 11px;
+    color: var(--muted-color);
+    line-height: 1.4;
+  }
+  .d3-text-metrics .visualization {
+    margin-top: 8px;
+    padding: 8px;
+    background: oklch(from var(--primary-color) calc(l + 0.45) c h / 0.06);
+    border-radius: 4px;
+    font-size: 10px;
+  }
+  [data-theme="dark"] .d3-text-metrics .visualization {
+    background: oklch(from var(--primary-color) calc(l + 0.20) c h / 0.1);
+  }
+  .d3-text-metrics .token {
+    display: inline-block;
+    padding: 2px 5px;
+    margin: 2px;
+    border-radius: 3px;
+    font-size: 10px;
+    background: var(--surface-bg);
+    border: 1px solid var(--border-color);
+  }
+  .d3-text-metrics .token.match {
+    background: oklch(from var(--primary-color) calc(l + 0.35) c h / 0.35);
+    border-color: var(--primary-color);
+    font-weight: 600;
+  }
+  [data-theme="dark"] .d3-text-metrics .token.match {
+    background: oklch(from var(--primary-color) calc(l + 0.25) c h / 0.4);
+  }
+  .d3-text-metrics .controls {
+    display: flex;
+    justify-content: center;
+    margin-bottom: 16px;
+  }
+  .d3-text-metrics select {
+    font-size: 12px;
+    padding: 6px 24px 6px 10px;
+    border: 1px solid var(--border-color);
+    border-radius: 6px;
+    background: var(--surface-bg);
+    color: var(--text-color);
+    cursor: pointer;
+    appearance: none;
+    background-image: url("data:image/svg+xml;charset=UTF-8,%3csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 24 24' fill='none' stroke='currentColor' stroke-width='2' stroke-linecap='round' stroke-linejoin='round'%3e%3cpolyline points='6 9 12 15 18 9'%3e%3c/polyline%3e%3c/svg%3e");
+    background-repeat: no-repeat;
+    background-position: right 6px center;
+    background-size: 12px;
+  }
+  @media (max-width: 768px) {
+    .d3-text-metrics .metrics-grid {
+      grid-template-columns: 1fr;
+    }
+  }
+</style>
+<script>
+  (() => {
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-text-metrics'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-text-metrics'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Single example: Cat Evaluator
+      const reference = "My cat loves doing model evaluation and testing benchmarks";
+      const prediction = "My cat enjoys model evaluation and testing models";
+      const tokenize = (text) => text.toLowerCase().trim().split(/\s+/);
+      const getNgrams = (tokens, n) => {
+        const ngrams = [];
+        for (let i = 0; i <= tokens.length - n; i++) {
+          ngrams.push(tokens.slice(i, i + n));
+        }
+        return ngrams;
+      };
+      const computeExactMatch = (pred, ref) => {
+        return pred.toLowerCase().trim() === ref.toLowerCase().trim() ? 1.0 : 0.0;
+      };
+      const computeBleu = (pred, ref) => {
+        const predTokens = tokenize(pred);
+        const refTokens = tokenize(ref);
+        if (predTokens.length === 0) return { score: 0, details: [] };
+        const details = [];
+        const precisions = [];
+        for (let n = 1; n <= 3; n++) {
+          const predNgrams = getNgrams(predTokens, n);
+          const refNgrams = getNgrams(refTokens, n);
+          if (predNgrams.length === 0) {
+            precisions.push(0);
+            continue;
+          }
+          const refCounts = {};
+          refNgrams.forEach(ng => {
+            const key = ng.join(' ');
+            refCounts[key] = (refCounts[key] || 0) + 1;
+          });
+          let matches = 0;
+          const matchedNgrams = [];
+          const predCounts = {};
+          predNgrams.forEach(ng => {
+            const key = ng.join(' ');
+            predCounts[key] = (predCounts[key] || 0) + 1;
+            if (refCounts[key] && predCounts[key] <= refCounts[key]) {
+              matches++;
+              if (!matchedNgrams.includes(key)) matchedNgrams.push(key);
+            }
+          });
+          const precision = matches / predNgrams.length;
+          precisions.push(precision);
+          details.push({ n, matches, total: predNgrams.length, matchedNgrams });
+        }
+        const validPrecisions = precisions.filter(p => p > 0);
+        const score = validPrecisions.length > 0
+          ? Math.exp(validPrecisions.reduce((sum, p) => sum + Math.log(p), 0) / validPrecisions.length)
+          : 0;
+        return { score, details };
+      };
+      const computeRouge1 = (pred, ref) => {
+        const predTokens = tokenize(pred);
+        const refTokens = tokenize(ref);
+        const predCounts = {};
+        const refCounts = {};
+        predTokens.forEach(t => predCounts[t] = (predCounts[t] || 0) + 1);
+        refTokens.forEach(t => refCounts[t] = (refCounts[t] || 0) + 1);
+        let overlap = 0;
+        const matchedTokens = [];
+        Object.keys(refCounts).forEach(token => {
+          if (predCounts[token]) {
+            overlap += Math.min(predCounts[token], refCounts[token]);
+            matchedTokens.push(token);
+          }
+        });
+        const recall = refTokens.length > 0 ? overlap / refTokens.length : 0;
+        const precision = predTokens.length > 0 ? overlap / predTokens.length : 0;
+        const f1 = (precision + recall) > 0 ? 2 * precision * recall / (precision + recall) : 0;
+        return { score: f1, recall, precision, matchedTokens };
+      };
+      const computeRouge2 = (pred, ref) => {
+        const predTokens = tokenize(pred);
+        const refTokens = tokenize(ref);
+        const predBigrams = getNgrams(predTokens, 2);
+        const refBigrams = getNgrams(refTokens, 2);
+        if (refBigrams.length === 0) {
+          return { score: 0, recall: 0, precision: 0, matchedBigrams: [] };
+        }
+        const predCounts = {};
+        const refCounts = {};
+        predBigrams.forEach(bg => {
+          const key = bg.join(' ');
+          predCounts[key] = (predCounts[key] || 0) + 1;
+        });
+        refBigrams.forEach(bg => {
+          const key = bg.join(' ');
+          refCounts[key] = (refCounts[key] || 0) + 1;
+        });
+        let overlap = 0;
+        const matchedBigrams = [];
+        Object.keys(refCounts).forEach(bigram => {
+          if (predCounts[bigram]) {
+            overlap += Math.min(predCounts[bigram], refCounts[bigram]);
+            matchedBigrams.push(bigram);
+          }
+        });
+        const recall = refBigrams.length > 0 ? overlap / refBigrams.length : 0;
+        const precision = predBigrams.length > 0 ? overlap / predBigrams.length : 0;
+        const f1 = (precision + recall) > 0 ? 2 * precision * recall / (precision + recall) : 0;
+        return { score: f1, recall, precision, matchedBigrams };
+      };
+      const computeEditDistanceWithOps = (s1, s2) => {
+        const m = s1.length;
+        const n = s2.length;
+        // Create DP table
+        const dp = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
+        // Initialize
+        for (let i = 0; i <= m; i++) dp[i][0] = i;
+        for (let j = 0; j <= n; j++) dp[0][j] = j;
+        // Fill DP table
+        for (let i = 1; i <= m; i++) {
+          for (let j = 1; j <= n; j++) {
+            if (s1[i - 1] === s2[j - 1]) {
+              dp[i][j] = dp[i - 1][j - 1];
+            } else {
+              dp[i][j] = 1 + Math.min(
+                dp[i - 1][j],     // delete
+                dp[i][j - 1],     // insert
+                dp[i - 1][j - 1]  // substitute
+              );
+            }
+          }
+        }
+        // Traceback to find operations
+        const operations = [];
+        let i = m, j = n;
+        while (i > 0 || j > 0) {
+          if (i === 0) {
+            operations.unshift({ type: 'insert', value: s2[j - 1], pos: j });
+            j--;
+          } else if (j === 0) {
+            operations.unshift({ type: 'delete', value: s1[i - 1], pos: i });
+            i--;
+          } else if (s1[i - 1] === s2[j - 1]) {
+            i--;
+            j--;
+          } else {
+            const deleteCost = dp[i - 1][j];
+            const insertCost = dp[i][j - 1];
+            const substituteCost = dp[i - 1][j - 1];
+            if (substituteCost <= deleteCost && substituteCost <= insertCost) {
+              operations.unshift({ type: 'substitute', from: s1[i - 1], to: s2[j - 1], pos: i });
+              i--;
+              j--;
+            } else if (deleteCost <= insertCost) {
+              operations.unshift({ type: 'delete', value: s1[i - 1], pos: i });
+              i--;
+            } else {
+              operations.unshift({ type: 'insert', value: s2[j - 1], pos: j });
+              j--;
+            }
+          }
+        }
+        return { distance: dp[m][n], operations };
+      };
+      const computeTer = (pred, ref) => {
+        const predTokens = tokenize(pred);
+        const refTokens = tokenize(ref);
+        const result = computeEditDistanceWithOps(predTokens, refTokens);
+        const score = refTokens.length > 0 ? result.distance / refTokens.length : 1.0;
+        return {
+          score,
+          edits: result.distance,
+          refLength: refTokens.length,
+          operations: result.operations
+        };
+      };
+      const computeBleurtMock = (pred, ref) => {
+        const predTokens = new Set(tokenize(pred));
+        const refTokens = new Set(tokenize(ref));
+        const intersection = new Set([...predTokens].filter(t => refTokens.has(t)));
+        const union = new Set([...predTokens, ...refTokens]);
+        const jaccard = union.size > 0 ? intersection.size / union.size : 0;
+        return { score: jaccard * 1.5 - 0.5, jaccard };
+      };
+      const render = () => {
+        const exactMatch = computeExactMatch(prediction, reference);
+        const bleu = computeBleu(prediction, reference);
+        const rouge1 = computeRouge1(prediction, reference);
+        const rouge2 = computeRouge2(prediction, reference);
+        const ter = computeTer(prediction, reference);
+        const bleurt = computeBleurtMock(prediction, reference);
+        container.innerHTML = `
+          <div class="example-text">
+            <span class="label">REF:</span>${reference}
+          </div>
+          <div class="example-text">
+            <span class="label">PRED:</span>${prediction}
+          </div>
+          <div class="metrics-grid">
+            <!-- Row 1: Exact Match, TER, BLEURT -->
+            <div class="metric-box">
+              <div class="metric-name">Exact Match</div>
+              <div class="metric-score">${exactMatch.toFixed(1)}</div>
+              <div class="metric-detail">Binary: 1 or 0</div>
+              <div class="visualization">
+                <div style="margin: 4px 0; font-size: 14px;">
+                  ${exactMatch === 1 ? '✓ Strings are identical' : '✗ Strings differ'}
+                </div>
+                <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color);">
+                  Most strict metric - no partial credit
+                </div>
+              </div>
+            </div>
+            <div class="metric-box">
+              <div class="metric-name">Translation Error Rate</div>
+              <div class="metric-score">${ter.score.toFixed(3)}</div>
+              <div class="metric-detail">Edit distance normalized</div>
+              <div class="visualization">
+                <div style="margin: 4px 0;">
+                  <strong>${ter.edits}</strong> edits / <strong>${ter.refLength}</strong> words = <strong>${ter.score.toFixed(3)}</strong>
+                </div>
+                ${ter.operations.length > 0 ? `
+                  <div style="margin-top: 8px; font-size: 10px;">
+                    <div style="margin-bottom: 4px; color: var(--muted-color);">Edit operations:</div>
+                    ${ter.operations.map((op, idx) => {
+                      if (op.type === 'substitute') {
+                        return `<div style="margin: 2px 0;">• Replace "<strong>${op.from}</strong>" → "<strong>${op.to}</strong>"</div>`;
+                      } else if (op.type === 'delete') {
+                        return `<div style="margin: 2px 0;">• Delete "<strong>${op.value}</strong>"</div>`;
+                      } else if (op.type === 'insert') {
+                        return `<div style="margin: 2px 0;">• Insert "<strong>${op.value}</strong>"</div>`;
+                      }
+                    }).join('')}
+                  </div>
+                ` : ''}
+                <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color);">
+                  Lower is better (0 = identical)
+                </div>
+              </div>
+            </div>
+            <div class="metric-box">
+              <div class="metric-name">BLEURT</div>
+              <div class="metric-score">${bleurt.score.toFixed(3)}</div>
+              <div class="metric-detail">Semantic similarity</div>
+              <div class="visualization">
+                <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color); font-style: italic;">
+                  Note: Real BLEURT uses BERT embeddings trained on human judgments. This is a mock using Jaccard similarity.
+                </div>
+              </div>
+            </div>
+            <!-- Row 2: BLEU, ROUGE-1, ROUGE-2 -->
+            <div class="metric-box">
+              <div class="metric-name">BLEU</div>
+              <div class="metric-score">${bleu.score.toFixed(3)}</div>
+              <div class="metric-detail">N-gram precision-based</div>
+              <div class="visualization">
+                ${bleu.details.map(d => `
+                  <div style="margin: 4px 0;">
+                    <strong>${d.n}-gram:</strong> ${d.matches}/${d.total} (${(d.matches/d.total*100).toFixed(0)}%)
+                  </div>
+                  <div style="margin: 2px 0;">
+                    ${d.matchedNgrams.slice(0, 3).map(ng => `<span class="token match">${ng}</span>`).join('')}
+                    ${d.matchedNgrams.length > 3 ? `<span style="color: var(--muted-color); font-size: 10px;">+${d.matchedNgrams.length - 3} more</span>` : ''}
+                  </div>
+                `).join('')}
+              </div>
+            </div>
+            <div class="metric-box">
+              <div class="metric-name">ROUGE-1</div>
+              <div class="metric-score">${rouge1.score.toFixed(3)}</div>
+              <div class="metric-detail">Unigram-based F1</div>
+              <div class="visualization">
+                <div style="margin: 4px 0;">
+                  <strong>Recall:</strong> ${(rouge1.recall * 100).toFixed(0)}% | <strong>Precision:</strong> ${(rouge1.precision * 100).toFixed(0)}%
+                </div>
+                <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color);">
+                  Matched unigrams:
+                </div>
+                ${rouge1.matchedTokens.length > 0 ? `
+                  <div style="margin: 2px 0;">
+                    ${rouge1.matchedTokens.slice(0, 5).map(t => `<span class="token match">${t}</span>`).join('')}
+                    ${rouge1.matchedTokens.length > 5 ? `<span style="color: var(--muted-color); font-size: 10px;">+${rouge1.matchedTokens.length - 5} more</span>` : ''}
+                  </div>
+                ` : ''}
+              </div>
+            </div>
+            <div class="metric-box">
+              <div class="metric-name">ROUGE-2</div>
+              <div class="metric-score">${rouge2.score.toFixed(3)}</div>
+              <div class="metric-detail">Bigram-based F1</div>
+              <div class="visualization">
+                <div style="margin: 4px 0;">
+                  <strong>Recall:</strong> ${(rouge2.recall * 100).toFixed(0)}% | <strong>Precision:</strong> ${(rouge2.precision * 100).toFixed(0)}%
+                </div>
+                <div style="margin-top: 6px; font-size: 9px; color: var(--muted-color);">
+                  Matched bigrams:
+                </div>
+                ${rouge2.matchedBigrams.length > 0 ? `
+                  <div style="margin: 2px 0;">
+                    ${rouge2.matchedBigrams.slice(0, 3).map(bg => `<span class="token match">${bg}</span>`).join('')}
+                    ${rouge2.matchedBigrams.length > 3 ? `<span style="color: var(--muted-color); font-size: 10px;">+${rouge2.matchedBigrams.length - 3} more</span>` : ''}
+                  </div>
+                ` : '<div style="margin: 2px 0; font-size: 10px; color: var(--muted-color);">No bigram matches</div>'}
+              </div>
+            </div>
+          </div>
+        `;
+      };
+      render();
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', bootstrap, { once: true });
+    } else {
+      bootstrap();
+    }
+  })();
+</script>