evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 10 days ago

Commit

7ccc792

1 Parent(s): 6ef2a16

tmp

Browse files

Files changed (4) hide show

app/src/content/article.mdx +0 -2
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +15 -29
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx +11 -7
app/src/content/embeds/d3-evaluation-decision-tree.html +333 -0

app/src/content/article.mdx CHANGED Viewed

@@ -91,8 +91,6 @@ Best (but rarest) metrics are functional or based on rule based verifiers (thoug
 ## Creating your own evaluation
 <DesigningAutomaticEvaluation />


91
92	## Creating your own evaluation
93


94	<DesigningAutomaticEvaluation />
95
96

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -20,8 +20,6 @@ When aggregating datasets, pay attention to whether
 <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
-#### Creating a dataset manually
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically
@@ -45,33 +43,6 @@ Once this is done, you can do an automatic validation by using a model from a di
 No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
 </Note>
-#### Choosing a prompt
-The prompt is going to define:
-- how much information is given to your model about the task
-- how this information is presented to your model.
-A prompt for a general MCQA or QA is usually made of some of the following:
-- a task prompt (optional): introduces your task.
-- a context: provides additional context for your question.
-	- *Eg: For a summarization or information extraction task, you could provide a content source*
-- a question: the actual core of your prompt.
-- in case of a multi choice evaluation, you can add options
-- connector words (`Question`, `Context`, `Choice`, ...)
-When defining your prompt, you need to be aware that:
-- even small changes in semantically equivalent prompts can make the results vary by quite a lot (see Section `Different prompt` in [Troubleshooting reproducibility](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-reproducibility.md)), and prompt formats might advantage or disadvantage specific models
-	- How to mitigate this:
-		- A costly way is to re-run the evaluation several times with prompt variations
-		- A less costly way is to run your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty
-- you can provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall
-- for a number of metrics, you want a very constrained generation or output.
-<Note title="Models can overfit prompt formats" emoji="⚠️" variant="warning">
-Recent research shows models can overfit specific prompt formats rather than learning the underlying task. [This paper](https://arxiv.org/abs/2407.07890) is great on the topic, showing notably how some models can be over-evaluated because they have overfitted the test set **format**.
-On the Open LLM Leaderboard 2, we've notably observed that Llama 3.2 and Qwen 2.5 are no longer following the format of the prompt provided in a few-shot setup for this reason.
-</Note>
 #### Managing contamination
 In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
@@ -83,6 +54,21 @@ Solutions to mitigate this include:
 However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
 ### Choosing an inference method for your model
 You'll need to choose what kind of inference method you need.

 <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically
 No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
 </Note>
 #### Managing contamination
 In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
 However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
+### Choosing a prompt
+The prompt is going to define how much information is given to your model about the task, and how this information is presented to the model.
+A prompt for a general MCQA or QA is usually made of some of the following:
+- a task prompt (optional): introduces your task.
+- a context: provides additional context for your question.
+	- *Eg: For a summarization or information extraction task, you could provide a content source*
+- a question: the actual core of your prompt.
+- in case of a multi choice evaluation, you can add options
+- connector words (`Question`, `Context`, `Choice`, ...)
+When defining your prompt, you need to be aware that even small changes in semantically equivalent prompts can make the results vary by quite a lot, and prompt formats might advantage or disadvantage specific models (See [this section](https://huggingface.co/spaces/OpenEvals/evaluation-guidebook#different-prompt)).
+➡️ This can be mitigated by re-running the evaluation several times with prompt variations (but it can be costly), or simply running your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty.
+➡️ You can also provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall.
 ### Choosing an inference method for your model
 You'll need to choose what kind of inference method you need.

app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED Viewed

@@ -113,24 +113,28 @@ Different tokenizers behave differently with spacing and special tokens. See thi
 When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
-However, some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices is not easy, as the context tokens can "bleed out" into them, messing up the comparison.
-<Sidenote>
-The [Llama tokenizer](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257) doesn't satisfy `enc(context + choice) = enc(context) + enc(choice)`, making log probability comparisons tricky. Tokenize separately and concatenate, removing special tokens.
-</Sidenote>
-So if this is the case for your model, you might want to compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
 **Paying attention to start and end of sentence tokens**
-Some models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
 You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
 **Multilinguality and tokenization**
-When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc.
 **Code evaluations and end of sentence tokens**

 When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
+<Note title="Should you tokenize the context with the choices always?">
+Some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices only is not trivial, as the context tokens can "bleed out" into them, messing up the comparison.
+To give a concrete example, say you have characters `C1`, `C2`, and `C3` as base tokens of your vocabulary, and `C1C2` also happens to be a single token learned during BPE.
+Say your context is C1, and the choices C2 and C3.
+If you tokenize the context with the choices, you compare `C1C2` (one token) with `C1+C3` (two tokens). Even if you normalize the logprobs by length, you are not comparing the same thing.
+Comparing after tokenizing the context and choices separately means you compare `C1+C2` and `C1+C3`. But since `C1C2` is a token, the occurence of `C1+C2` is likely rare in the data your encoder saw, so it is an unlikely succession for your model, which can mess up your logprobabilities.
+If this is the case for your model, the solution is usually to go for the least worst option, comparing the comparable: compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
+</Note>
 **Paying attention to start and end of sentence tokens**
+Some pretrained models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
 You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
 **Multilinguality and tokenization**
+When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc. The number of tokens that the model is allowed to generate for an evaluation should also be language dependent, as not all languages are tokenized in similar amount of tokens (go back to the tokenization section to see why).
 **Code evaluations and end of sentence tokens**

app/src/content/embeds/d3-evaluation-decision-tree.html ADDED Viewed

	@@ -0,0 +1,333 @@

+<div class="d3-evaluation-tree"></div>
+<style>
+  .d3-evaluation-tree {
+    position: relative;
+    width: 100%;
+    min-height: 500px;
+    overflow: visible;
+  }
+  .d3-evaluation-tree svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-evaluation-tree .node-rect {
+    stroke-width: 2;
+    rx: 8;
+    ry: 8;
+    cursor: pointer;
+    transition: all 0.2s ease;
+  }
+  .d3-evaluation-tree .decision-node {
+    stroke: var(--border-color);
+  }
+  .d3-evaluation-tree .result-node {
+    stroke: var(--border-color);
+  }
+  .d3-evaluation-tree .warning-node {
+    stroke: var(--border-color);
+  }
+  .d3-evaluation-tree .node-text {
+    fill: var(--text-color);
+    font-size: 12px;
+    font-weight: 500;
+    pointer-events: none;
+    user-select: none;
+  }
+  .d3-evaluation-tree .link {
+    fill: none;
+    stroke: var(--border-color);
+    stroke-width: 1.5;
+    opacity: 0.5;
+  }
+  .d3-evaluation-tree .link-label {
+    fill: var(--muted-color);
+    font-size: 10px;
+    font-weight: 500;
+  }
+  .d3-evaluation-tree .node-rect:hover {
+    filter: brightness(1.05);
+    stroke-width: 3;
+  }
+  .d3-evaluation-tree .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 8px 10px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.35;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity .12s ease;
+    max-width: 250px;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => {
+        if (window.d3 && typeof window.d3.select === 'function') cb();
+      };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-evaluation-tree'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-evaluation-tree'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      let tip = container.querySelector('.d3-tooltip');
+      let tipInner;
+      if (!tip) {
+        tip = document.createElement('div');
+        tip.className = 'd3-tooltip';
+        tipInner = document.createElement('div');
+        tipInner.className = 'd3-tooltip__inner';
+        tipInner.style.textAlign = 'left';
+        tip.appendChild(tipInner);
+        container.appendChild(tip);
+      } else {
+        tipInner = tip.querySelector('.d3-tooltip__inner') || tip;
+      }
+      // Get colors from ColorPalettes with fallback
+      const getColors = () => {
+        if (window.ColorPalettes && window.ColorPalettes.getColors) {
+          return {
+            decision: window.ColorPalettes.getColors('sequential', 3)[0],
+            result: window.ColorPalettes.getColors('sequential', 3)[2],
+            warning: window.ColorPalettes.getColors('diverging', 3)[1]
+          };
+        }
+        // Fallback colors
+        return {
+          decision: '#60A5FA',
+          result: '#34D399',
+          warning: '#FBBF24'
+        };
+      };
+      // Define the decision tree structure
+      const treeData = {
+        name: "What are you\nevaluating?",
+        type: "decision",
+        tooltip: "Starting point: Identify your evaluation task",
+        children: [
+          {
+            name: "Have gold\nstandard?",
+            edgeLabel: "Start",
+            type: "decision",
+            tooltip: "Do you have a clear, correct reference answer?",
+            children: [
+              {
+                name: "Objective &\nverifiable?",
+                edgeLabel: "Yes",
+                type: "decision",
+                tooltip: "Is the answer factual and unambiguous?",
+                children: [
+                  {
+                    name: "Format\nconstrained?",
+                    edgeLabel: "Yes",
+                    type: "decision",
+                    tooltip: "Can you verify output structure programmatically?",
+                    children: [
+                      {
+                        name: "Functional\nTesting",
+                        edgeLabel: "Yes",
+                        type: "result",
+                        tooltip: "Use IFEval-style functional tests or unit tests"
+                      },
+                      {
+                        name: "Automated\nMetrics",
+                        edgeLabel: "No",
+                        type: "result",
+                        tooltip: "Use exact match, F1, BLEU, etc."
+                      }
+                    ]
+                  }
+                ]
+              },
+              {
+                name: "Human Eval\nor Judges",
+                edgeLabel: "Subjective",
+                type: "warning",
+                tooltip: "Multiple valid answers exist; need human judgment or model judges"
+              }
+            ]
+          },
+          {
+            name: "Budget &\nscale?",
+            edgeLabel: "No gold",
+            type: "decision",
+            tooltip: "No reference answer available",
+            children: [
+              {
+                name: "Expert Human\nAnnotators",
+                edgeLabel: "High",
+                type: "result",
+                tooltip: "Best for critical use cases (medical, legal)"
+              },
+              {
+                name: "Model Judges\n(validate!)",
+                edgeLabel: "Medium",
+                type: "warning",
+                tooltip: "Validate judge quality against human baseline"
+              },
+              {
+                name: "Arena or\nVibe-checks",
+                edgeLabel: "Low",
+                type: "warning",
+                tooltip: "Crowdsourced or exploratory evaluation"
+              }
+            ]
+          }
+        ]
+      };
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const g = svg.append('g').attr('transform', 'translate(40, 30)');
+      let width = container.clientWidth || 900;
+      const nodeWidth = 140;
+      const nodeHeight = 50;
+      function render() {
+        const colors = getColors();
+        width = container.clientWidth || 900;
+        const treeLayout = d3.tree()
+          .size([width - 80, 500])
+          .separation((a, b) => (a.parent === b.parent ? 1.3 : 1.6));
+        const root = d3.hierarchy(treeData);
+        const treeNodes = treeLayout(root);
+        const maxDepth = root.height;
+        const height = (maxDepth + 1) * 120 + 60;
+        svg.attr('viewBox', `0 0 ${width} ${height}`)
+           .attr('preserveAspectRatio', 'xMidYMin meet');
+        // Clear previous
+        g.selectAll('*').remove();
+        // Links
+        g.selectAll('.link')
+          .data(treeNodes.links())
+          .join('path')
+          .attr('class', 'link')
+          .attr('d', d3.linkVertical()
+            .x(d => d.x)
+            .y(d => d.y)
+          );
+        // Link labels
+        g.selectAll('.link-label')
+          .data(treeNodes.links().filter(d => d.target.data.edgeLabel))
+          .join('text')
+          .attr('class', 'link-label')
+          .attr('x', d => d.target.x)
+          .attr('y', d => (d.source.y + d.target.y) / 2 - 5)
+          .attr('text-anchor', 'middle')
+          .text(d => d.target.data.edgeLabel);
+        // Node groups
+        const nodes = g.selectAll('.node')
+          .data(treeNodes.descendants())
+          .join('g')
+          .attr('class', 'node')
+          .attr('transform', d => `translate(${d.x},${d.y})`)
+          .on('mouseenter', function(event, d) {
+            if (d.data.tooltip) {
+              const [mx, my] = d3.pointer(event, container);
+              tip.style.opacity = '1';
+              tip.style.transform = `translate(${mx + 10}px, ${my - 10}px)`;
+              tipInner.textContent = d.data.tooltip;
+            }
+          })
+          .on('mouseleave', function() {
+            tip.style.opacity = '0';
+            tip.style.transform = 'translate(-9999px, -9999px)';
+          });
+        // Rectangles
+        nodes.append('rect')
+          .attr('class', d => {
+            if (d.data.type === 'result') return 'node-rect result-node';
+            if (d.data.type === 'warning') return 'node-rect warning-node';
+            return 'node-rect decision-node';
+          })
+          .attr('x', -nodeWidth / 2)
+          .attr('y', -nodeHeight / 2)
+          .attr('width', nodeWidth)
+          .attr('height', nodeHeight)
+          .attr('fill', d => {
+            if (d.data.type === 'result') return colors.result;
+            if (d.data.type === 'warning') return colors.warning;
+            return colors.decision;
+          });
+        // Text (multiline support)
+        nodes.each(function(d) {
+          const nodeG = d3.select(this);
+          const lines = d.data.name.split('\n');
+          const lineHeight = 14;
+          const startY = -(lines.length - 1) * lineHeight / 2;
+          lines.forEach((line, i) => {
+            nodeG.append('text')
+              .attr('class', 'node-text')
+              .attr('text-anchor', 'middle')
+              .attr('y', startY + i * lineHeight)
+              .attr('dy', '0.35em')
+              .text(line);
+          });
+        });
+      }
+      // Initial render
+      render();
+      // Responsive resize
+      if (window.ResizeObserver) {
+        const ro = new ResizeObserver(() => render());
+        ro.observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>