evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 5 days ago

Commit

112a899

1 Parent(s): 9a4bbbe

fix + more evals + figure

Browse files

Files changed (3) hide show

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +11 -5
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx +8 -4
app/src/content/embeds/d3-sampling-metrics.html +506 -0

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -183,11 +183,13 @@ When models generate outputs, sampling multiple times and aggregating results ca
 This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
 Common sampling-based metrics are:
-- **pass@k over n**: Given n generated samples, measures whether at least k passes the test.
-<Sidenote> You'll find two functions for this metric: computed trivially as: $\text{pass}@k = (c >= k)$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
 - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
-- **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
-- **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
 When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
@@ -546,7 +548,9 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
 - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
 </Note>
-### Confidence and score reporting
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
@@ -554,3 +558,5 @@ These confidence intervals from the raw scores can be obtained from standard dev
 You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.

 This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
 Common sampling-based metrics are:
+- **pass@k over n**: Given n generated samples, measures whether at least k passes the test. <Sidenote> You'll find two functions for pass@k: computed trivially as: $\text{pass}@k = (c >= k)$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
 - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
+- **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
+- **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
+<HtmlEmbed src="d3-sampling-metrics.html" title="Sampling metrics comparison" />
 When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
 - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
 </Note>
+### The forgotten children of evaluation
+#### Confidence
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
 You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
+#### Cost

app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED Viewed

@@ -61,6 +61,8 @@ For post training, you want more holistic evaluations, and a couple benchmarks m
 [**SweBench**](https://openreview.net/pdf?id=VTF8yNQM66) (2024) is a more well known and complete version of this, also using github, but this time testing if models can solve existing issues, so logic understanding, cross file editing and execution, long context reasoning, etc.
 At this time, I would recommend following LiveCodeBench, AiderBench and the higher quality subset of SWE-Bench (SWE-Bench verified), and reading the [METR report](https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/) on actual code assistant usefulness.
 #### Long context
@@ -98,6 +100,7 @@ Lastly, with the creation of MCPs, some benchmarks arose to test MCP oriented to
 [**LiveMCPBench**](https://arxiv.org/abs/2508.01780) (2025) provides a large locally deployable collection of MCP servers to test how good models are at discriminating between tools to accomplish tasks. Best models are already reaching 80% - so we're close to saturation. However, testing if models can select proper tools in very long lists is a good use case which will be increasingly important as the web goes mcp.
 (By the way, here's a cool [doc](https://www.anthropic.com/engineering/writing-tools-for-agents) on how to write good tools.)
 While testing individual capabilities provides valuable signal, real-world assistant performance comes from how these capabilities combine. A model might excel at reasoning but fail when that reasoning must be integrated with tool calling and long context management simultaneously, so we need evaluations requiring the orchestration of multiple capabilities together.
@@ -134,7 +137,8 @@ The most famous formal evaluation among these is probably [ARC-AGI](https://arcp
 The community and model providers have explored a number of existing games with LLMs. Single player adventure games/RPGs like [TextQuests](https://huggingface.co/blog/textquests) (2025) or [Pokemon](https://github.com/benchflow-ai/benchflow/tree/main/libs/pokemon-gym) (2024) (Twitch for [Claude](https://www.twitch.tv/claudeplayspokemon) and [Gemini](https://www.twitch.tv/gemini_plays_pokemon) for ex) require a combination of very long range planning to get objectives, which require adequante long context memory management, reasoning, and backtracking abilities. Same abilities are needed for single player survival games like [Crafter](https://arxiv.org/abs/2109.06780) (2021, Minecraft inspired). A number of single player game environments have been integrated into the [Balrog](https://arxiv.org/pdf/2411.13543) (2024) benchmark.
-Competitive bluffing games like [Poker](https://arxiv.org/html/2501.08328v1) (2025) or Mafia variations like [Town of Salem](https://github.com/summersonnn/Town-Of-Salem-with-LLMs) (2025) and Werewolf (2025, [here](https://arxiv.org/abs/2407.13943)/[there](https://werewolf.foaster.ai/)) are very interesting to test logic, reasoning, as well as deception abilities. Claude Opus 4 is for example incapable of winning Town of Salem as a vampire (deceptive role) but does well as a peasant (non deceptive role). Cooperative games like Hanabi can also be used to test adaptability and communication ability in a constrained environment.
 What's also very neat about these is that they have a single and unambiguous pass/fail metric: did the LLM win the game or not? At the moment, if I were to use these to evaluate models I would probably look at TextQuests for abilities and Town of Salem for safety.
@@ -149,16 +153,16 @@ In the last year, a new category of impossible to contaminate tasks emerged: for
 A similar approach is used to generate questions in [Arbitrage](https://arxiv.org/pdf/2412.18544), the core difference being the time horizon: events there should be resolved in 2028.
-In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
 <Note title="TLDR" emoji="🎯">
 The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
 As of Nov 2025, I recommend using:
-- **Core capabilities** (for model builders): Old capabilities evals for training, and for post training MATH500/AIME24, GPQA, IFEval, SWE-Bench, a long range eval of your choice like HELMET, TauBench or BFCL if you're targetting tool use
 - **Core capabilities** (for comparing models at inference): IFBench, HLE, MathArena, AiderBench and LiveCodeBench, MCP-Universe
-- **Long horizon tasks** (for real-world performance): GAIA, DABStep, SciCode, or domain specific evaluations for your use cases
 - **Games** (for some extra fun in measuring robustness and adaptability): ARC-AGI3 when it's out, TextQuests, Town of Salem if you're interested in safety, or any other game you like which goes beyond Poker/Chess/Go.
 The field is moving toward evaluations that test capability orchestration rather than isolated skills for actual use. This matches our goal of building models that "work well"—systems that can reliably combine core capabilities, tool use, with a good orchestration to solve actual problems.

 [**SweBench**](https://openreview.net/pdf?id=VTF8yNQM66) (2024) is a more well known and complete version of this, also using github, but this time testing if models can solve existing issues, so logic understanding, cross file editing and execution, long context reasoning, etc.
+[**CodeClash**](https://codeclash.ai/) (2025) is the coding version of an arena, where models write code which competes against other models code, edit, and iterate.
 At this time, I would recommend following LiveCodeBench, AiderBench and the higher quality subset of SWE-Bench (SWE-Bench verified), and reading the [METR report](https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/) on actual code assistant usefulness.
 #### Long context
 [**LiveMCPBench**](https://arxiv.org/abs/2508.01780) (2025) provides a large locally deployable collection of MCP servers to test how good models are at discriminating between tools to accomplish tasks. Best models are already reaching 80% - so we're close to saturation. However, testing if models can select proper tools in very long lists is a good use case which will be increasingly important as the web goes mcp.
 (By the way, here's a cool [doc](https://www.anthropic.com/engineering/writing-tools-for-agents) on how to write good tools.)
 While testing individual capabilities provides valuable signal, real-world assistant performance comes from how these capabilities combine. A model might excel at reasoning but fail when that reasoning must be integrated with tool calling and long context management simultaneously, so we need evaluations requiring the orchestration of multiple capabilities together.
 The community and model providers have explored a number of existing games with LLMs. Single player adventure games/RPGs like [TextQuests](https://huggingface.co/blog/textquests) (2025) or [Pokemon](https://github.com/benchflow-ai/benchflow/tree/main/libs/pokemon-gym) (2024) (Twitch for [Claude](https://www.twitch.tv/claudeplayspokemon) and [Gemini](https://www.twitch.tv/gemini_plays_pokemon) for ex) require a combination of very long range planning to get objectives, which require adequante long context memory management, reasoning, and backtracking abilities. Same abilities are needed for single player survival games like [Crafter](https://arxiv.org/abs/2109.06780) (2021, Minecraft inspired). A number of single player game environments have been integrated into the [Balrog](https://arxiv.org/pdf/2411.13543) (2024) benchmark.
+Competitive bluffing games like [Poker](https://arxiv.org/html/2501.08328v1) (2025), Mafia variations like [Town of Salem](https://github.com/summersonnn/Town-Of-Salem-with-LLMs) (2025) and Werewolf (2025, [here](https://arxiv.org/abs/2407.13943)/[there](https://werewolf.foaster.ai/)), or [Among us](antimlabs.com/amongais
+) are very interesting to test logic, reasoning, as well as deception abilities. Claude Opus 4 is for example incapable of winning Town of Salem as a vampire (deceptive role) but does well as a peasant (non deceptive role). Cooperative games like [Hanabi](https://arxiv.org/abs/2510.04980) can also be used to test adaptability and communication ability in a constrained environment.
 What's also very neat about these is that they have a single and unambiguous pass/fail metric: did the LLM win the game or not? At the moment, if I were to use these to evaluate models I would probably look at TextQuests for abilities and Town of Salem for safety.
 A similar approach is used to generate questions in [Arbitrage](https://arxiv.org/pdf/2412.18544), the core difference being the time horizon: events there should be resolved in 2028.
+In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
 <Note title="TLDR" emoji="🎯">
 The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
 As of Nov 2025, I recommend using:
+- **Core capabilities** (for model builders): Old capabilities evals for training, and for post training AIME26 when it will come out, GPQA, IFEval, SWE-Bench, a long range eval of your choice like HELMET, TauBench or BFCL if you're targetting tool use
 - **Core capabilities** (for comparing models at inference): IFBench, HLE, MathArena, AiderBench and LiveCodeBench, MCP-Universe
+- **Long horizon tasks** (for real-world performance): GAIA2, DABStep, SciCode, or domain specific evaluations for your use cases
 - **Games** (for some extra fun in measuring robustness and adaptability): ARC-AGI3 when it's out, TextQuests, Town of Salem if you're interested in safety, or any other game you like which goes beyond Poker/Chess/Go.
 The field is moving toward evaluations that test capability orchestration rather than isolated skills for actual use. This matches our goal of building models that "work well"—systems that can reliably combine core capabilities, tool use, with a good orchestration to solve actual problems.

app/src/content/embeds/d3-sampling-metrics.html ADDED Viewed

	@@ -0,0 +1,506 @@

+<div class="d3-sampling-metrics"></div>
+<style>
+  .d3-sampling-metrics {
+    font-family: var(--default-font-family);
+    background: transparent;
+    border: none;
+    border-radius: 0;
+    padding: var(--spacing-4) 0;
+    width: 100%;
+    margin: 0 auto;
+    position: relative;
+  }
+  .d3-sampling-metrics svg {
+    width: 100%;
+    height: auto;
+    display: block;
+  }
+  .d3-sampling-metrics .sample-box {
+    stroke-width: 2;
+    transition: all 0.3s ease;
+  }
+  .d3-sampling-metrics .sample-box:hover {
+    filter: brightness(1.1);
+    stroke-width: 3;
+  }
+  .d3-sampling-metrics .metric-box {
+    stroke-width: 2;
+    transition: all 0.3s ease;
+  }
+  .d3-sampling-metrics .metric-box:hover {
+    filter: brightness(1.1);
+    stroke-width: 3;
+  }
+  .d3-sampling-metrics .sample-label {
+    fill: var(--text-color);
+    font-size: 11px;
+    font-weight: 600;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-sampling-metrics .sample-answer {
+    fill: var(--text-color);
+    font-size: 10px;
+    font-weight: 500;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-sampling-metrics .metric-label {
+    fill: var(--text-color);
+    font-size: 13px;
+    font-weight: 600;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-sampling-metrics .metric-description {
+    fill: var(--muted-color);
+    font-size: 10px;
+    font-weight: 500;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-sampling-metrics .metric-result {
+    font-size: 16px;
+    font-weight: 700;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-sampling-metrics .section-title {
+    fill: var(--text-color);
+    font-size: 12px;
+    font-weight: 700;
+    text-transform: uppercase;
+    letter-spacing: 0.05em;
+  }
+  .d3-sampling-metrics .question-text {
+    fill: var(--text-color);
+    font-size: 14px;
+    font-weight: 600;
+  }
+  .d3-sampling-metrics .link-line {
+    fill: none;
+    stroke-width: 1.5;
+    transition: all 0.3s ease;
+    opacity: 0.3;
+  }
+  .d3-sampling-metrics .marker {
+    opacity: 0.5;
+  }
+  .d3-sampling-metrics .d3-tooltip {
+    position: absolute;
+    background: var(--surface-bg);
+    border: 1px solid var(--border-color);
+    border-radius: 8px;
+    padding: 8px 10px;
+    font-size: 12px;
+    pointer-events: none;
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15);
+    z-index: 1000;
+    max-width: 350px;
+    line-height: 1.35;
+    white-space: pre-line;
+    color: var(--text-color);
+    transform: translate(-9999px, -9999px);
+  }
+  @media (max-width: 768px) {
+    .d3-sampling-metrics .sample-label {
+      font-size: 10px;
+    }
+    .d3-sampling-metrics .sample-answer {
+      font-size: 9px;
+    }
+    .d3-sampling-metrics .metric-label {
+      font-size: 11px;
+    }
+    .d3-sampling-metrics .metric-description {
+      font-size: 9px;
+    }
+    .d3-sampling-metrics .metric-result {
+      font-size: 14px;
+    }
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => {
+        if (window.d3 && typeof window.d3.select === 'function') cb();
+      };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-sampling-metrics'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-sampling-metrics'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      container.style.position = container.style.position || 'relative';
+      // Tooltip
+      let tip = container.querySelector('.d3-tooltip');
+      let tipInner;
+      if (!tip) {
+        tip = document.createElement('div');
+        tip.className = 'd3-tooltip';
+        tipInner = document.createElement('div');
+        tipInner.className = 'd3-tooltip__inner';
+        tipInner.style.textAlign = 'left';
+        tip.appendChild(tipInner);
+        container.appendChild(tip);
+      } else {
+        tipInner = tip.querySelector('.d3-tooltip__inner') || tip;
+      }
+      // Get colors from ColorPalettes or fallback
+      const getColors = () => {
+        if (window.ColorPalettes && typeof window.ColorPalettes.getColors === 'function') {
+          const cat = window.ColorPalettes.getColors('categorical', 5);
+          return {
+            correct: cat[2], // Green-ish
+            incorrect: cat[0], // Red-ish
+            metric: cat[4]
+          };
+        }
+        // Fallback to CSS variable based colors
+        const primaryColor = getComputedStyle(document.documentElement).getPropertyValue('--primary-color').trim() || '#6D4AFF';
+        return {
+          correct: '#4CAF50',
+          incorrect: '#F44336',
+          metric: primaryColor
+        };
+      };
+      // Example: Math problem "What is 15 + 27?"
+      // Correct answer: 42
+      // Samples with different answers
+      const samples = [
+        { id: 1, answer: '42', correct: true },
+        { id: 2, answer: '42', correct: true },
+        { id: 3, answer: '43', correct: false },
+        { id: 4, answer: '42', correct: true },
+        { id: 5, answer: '41', correct: false }
+      ];
+      const metrics = [
+        {
+          id: 'pass@1',
+          label: 'pass@1',
+          description: 'At least 1 correct',
+          result: '✓',
+          explanation: 'At least 1 of 5 samples is correct (we have 3 correct samples)',
+          usedSamples: [1]
+        },
+        {
+          id: 'pass@3',
+          label: 'pass@3',
+          description: 'At least 3 correct',
+          result: '✓',
+          explanation: 'At least 3 of 5 samples are correct (exactly 3 correct)',
+          usedSamples: [1, 2, 4]
+        },
+        {
+          id: 'maj@5',
+          label: 'maj@5',
+          description: 'Most frequent answer',
+          result: '42',
+          explanation: 'Most common answer: 42 appears 3 times vs 43 (1x) and 41 (1x)',
+          usedSamples: [1, 2, 3, 4, 5]
+        },
+        {
+          id: 'avg@5',
+          label: 'avg@5',
+          description: 'Average score',
+          result: '0.60',
+          explanation: 'Average correctness: 3 correct / 5 total = 0.60',
+          usedSamples: [1, 2, 3, 4, 5]
+        }
+      ];
+      const svg = d3.select(container).append('svg');
+      const g = svg.append('g');
+      // Arrow marker
+      svg.append('defs').append('marker')
+        .attr('id', 'arrow-sampling')
+        .attr('viewBox', '0 -5 10 10')
+        .attr('refX', 8)
+        .attr('refY', 0)
+        .attr('markerWidth', 5)
+        .attr('markerHeight', 5)
+        .attr('orient', 'auto')
+        .append('path')
+        .attr('d', 'M0,-5L10,0L0,5')
+        .attr('class', 'marker');
+      let width = 800;
+      let height = 500;
+      function render() {
+        width = container.clientWidth || 800;
+        height = Math.max(350, Math.round(width * 0.42));
+        svg.attr('width', width).attr('height', height);
+        const margin = { top: 60, right: 20, bottom: 20, left: 20 };
+        const innerWidth = width - margin.left - margin.right;
+        const innerHeight = height - margin.top - margin.bottom;
+        g.attr('transform', `translate(${margin.left},${margin.top})`);
+        // Clear previous content
+        g.selectAll('*').remove();
+        const colors = getColors();
+        // Question at the top
+        g.append('text')
+          .attr('class', 'question-text')
+          .attr('x', innerWidth / 2)
+          .attr('y', -35)
+          .attr('text-anchor', 'middle')
+          .text('Question: What is 15 + 27?');
+        g.append('text')
+          .attr('x', innerWidth / 2)
+          .attr('y', -18)
+          .attr('text-anchor', 'middle')
+          .attr('font-size', '11px')
+          .attr('fill', 'var(--muted-color)')
+          .text('(Correct answer: 42)');
+        // Layout
+        const sampleBoxWidth = Math.min(80, innerWidth * 0.12);
+        const sampleBoxHeight = 60;
+        const metricBoxWidth = Math.min(140, innerWidth * 0.22);
+        const metricBoxHeight = 75;
+        // Position samples in a row
+        const samplesY = 20;
+        const sampleSpacing = (innerWidth - sampleBoxWidth * samples.length) / (samples.length + 1);
+        const sampleNodes = samples.map((d, i) => ({
+          ...d,
+          x: sampleSpacing + i * (sampleBoxWidth + sampleSpacing),
+          y: samplesY,
+          width: sampleBoxWidth,
+          height: sampleBoxHeight
+        }));
+        // Position metrics below
+        const metricsY = samplesY + sampleBoxHeight + 60;
+        const metricSpacing = (innerWidth - metricBoxWidth * metrics.length) / (metrics.length + 1);
+        const metricNodes = metrics.map((d, i) => ({
+          ...d,
+          x: metricSpacing + i * (metricBoxWidth + metricSpacing),
+          y: metricsY,
+          width: metricBoxWidth,
+          height: metricBoxHeight
+        }));
+        // Section titles
+        g.append('text')
+          .attr('class', 'section-title')
+          .attr('x', innerWidth / 2)
+          .attr('y', samplesY - 10)
+          .attr('text-anchor', 'middle')
+          .text('5 SAMPLED GENERATIONS');
+        g.append('text')
+          .attr('class', 'section-title')
+          .attr('x', innerWidth / 2)
+          .attr('y', metricsY - 10)
+          .attr('text-anchor', 'middle')
+          .text('SAMPLING METRICS');
+        // Draw connection lines from samples to metrics
+        const linkGroup = g.append('g').attr('class', 'links');
+        metricNodes.forEach(metric => {
+          metric.usedSamples.forEach(sampleId => {
+            const sample = sampleNodes.find(s => s.id === sampleId);
+            if (sample) {
+              const sx = sample.x + sample.width / 2;
+              const sy = sample.y + sample.height;
+              const tx = metric.x + metric.width / 2;
+              const ty = metric.y;
+              linkGroup.append('line')
+                .attr('class', 'link-line')
+                .attr('x1', sx)
+                .attr('y1', sy)
+                .attr('x2', tx)
+                .attr('y2', ty)
+                .attr('stroke', colors.metric);
+            }
+          });
+        });
+        // Draw sample boxes
+        const sampleGroup = g.append('g').attr('class', 'samples');
+        const sampleBoxes = sampleGroup.selectAll('.sample')
+          .data(sampleNodes)
+          .join('g')
+          .attr('class', 'sample')
+          .attr('transform', d => `translate(${d.x},${d.y})`);
+        sampleBoxes.append('rect')
+          .attr('class', 'sample-box')
+          .attr('width', d => d.width)
+          .attr('height', d => d.height)
+          .attr('rx', 6)
+          .attr('fill', d => d.correct ? colors.correct : colors.incorrect)
+          .attr('fill-opacity', 0.3)
+          .attr('stroke', d => d.correct ? colors.correct : colors.incorrect)
+          .style('cursor', 'pointer')
+          .on('mouseenter', function(event, d) {
+            const status = d.correct ? 'Correct ✓' : 'Incorrect ✗';
+            tipInner.textContent = `Sample ${d.id}: "${d.answer}"\n${status}`;
+            tip.style.opacity = '1';
+            const [mx, my] = d3.pointer(event, container);
+            tip.style.transform = `translate(${mx + 10}px, ${my + 10}px)`;
+          })
+          .on('mouseleave', function() {
+            tip.style.opacity = '0';
+            tip.style.transform = 'translate(-9999px, -9999px)';
+          });
+        sampleBoxes.append('text')
+          .attr('class', 'sample-label')
+          .attr('x', d => d.width / 2)
+          .attr('y', 18)
+          .attr('text-anchor', 'middle')
+          .text(d => `#${d.id}`);
+        sampleBoxes.append('text')
+          .attr('class', 'sample-answer')
+          .attr('x', d => d.width / 2)
+          .attr('y', 35)
+          .attr('text-anchor', 'middle')
+          .attr('font-size', '14px')
+          .attr('font-weight', '700')
+          .text(d => d.answer);
+        sampleBoxes.append('text')
+          .attr('class', 'sample-label')
+          .attr('x', d => d.width / 2)
+          .attr('y', 50)
+          .attr('text-anchor', 'middle')
+          .attr('font-size', '10px')
+          .text(d => d.correct ? '✓' : '✗');
+        // Draw metric boxes
+        const metricGroup = g.append('g').attr('class', 'metrics');
+        const metricBoxes = metricGroup.selectAll('.metric')
+          .data(metricNodes)
+          .join('g')
+          .attr('class', 'metric')
+          .attr('transform', d => `translate(${d.x},${d.y})`);
+        metricBoxes.append('rect')
+          .attr('class', 'metric-box')
+          .attr('width', d => d.width)
+          .attr('height', d => d.height)
+          .attr('rx', 8)
+          .attr('fill', colors.metric)
+          .attr('fill-opacity', 0.35)
+          .attr('stroke', colors.metric)
+          .style('cursor', 'pointer')
+          .on('mouseenter', function(event, d) {
+            tipInner.textContent = d.explanation;
+            tip.style.opacity = '1';
+            const [mx, my] = d3.pointer(event, container);
+            tip.style.transform = `translate(${mx + 10}px, ${my + 10}px)`;
+          })
+          .on('mouseleave', function() {
+            tip.style.opacity = '0';
+            tip.style.transform = 'translate(-9999px, -9999px)';
+          });
+        metricBoxes.append('text')
+          .attr('class', 'metric-label')
+          .attr('x', d => d.width / 2)
+          .attr('y', 18)
+          .attr('text-anchor', 'middle')
+          .text(d => d.label);
+        metricBoxes.append('text')
+          .attr('class', 'metric-description')
+          .attr('x', d => d.width / 2)
+          .attr('y', 32)
+          .attr('text-anchor', 'middle')
+          .text(d => d.description);
+        metricBoxes.append('text')
+          .attr('class', 'metric-result')
+          .attr('x', d => d.width / 2)
+          .attr('y', 56)
+          .attr('text-anchor', 'middle')
+          .attr('fill', colors.metric)
+          .text(d => d.result);
+      }
+      render();
+      // Responsive handling
+      if (window.ResizeObserver) {
+        const ro = new ResizeObserver(() => render());
+        ro.observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>