evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 11 days ago

Commit

49f71ca

1 Parent(s): adc672a

wip

Browse files

Files changed (17) hide show

app/src/content/article.mdx +17 -51
app/src/content/assets/finetasks/code.js +245 -0
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +400 -58
app/src/content/chapters/automated-benchmarks/tips-and-tricks.mdx +0 -73
app/src/content/chapters/{2025-evaluations-for-useful-models.mdx → general-knowledge/2025-evaluations-for-useful-models.mdx} +2 -2
app/src/content/chapters/{picking-your-evaluation.mdx → general-knowledge/picking-your-evaluation.mdx} +9 -75
app/src/content/chapters/human-evaluation/basics.mdx +0 -88
app/src/content/chapters/human-evaluation/tips-and-tricks.mdx +0 -49
app/src/content/chapters/human-evaluation/using-human-annotators.mdx +31 -2
app/src/content/chapters/model-as-a-judge/basics.mdx +0 -50
app/src/content/chapters/model-as-a-judge/designing-your-evaluation-prompt.mdx +0 -81
app/src/content/chapters/model-as-a-judge/evaluating-your-evaluator.mdx +0 -61
app/src/content/chapters/model-as-a-judge/getting-a-judge-llm.mdx +0 -78
app/src/content/chapters/model-as-a-judge/tips-and-tricks.mdx +0 -51
app/src/content/chapters/model-as-a-judge/what-about-reward-models.mdx +0 -85
app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx +11 -0
app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx +29 -48

app/src/content/article.mdx CHANGED Viewed

@@ -22,24 +22,14 @@ import HtmlEmbed from "../components/HtmlEmbed.astro";
 import Intro from "./chapters/intro.mdx";
 import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx";
-import PickingYourEval from "./chapters/picking-your-evaluation.mdx";
-import EvalsIn2025 from "./chapters/2025-evaluations-for-useful-models.mdx"
-import AutomatedBenchmarksTips from "./chapters/automated-benchmarks/tips-and-tricks.mdx";
-import HumanEvaluationBasics from "./chapters/human-evaluation/basics.mdx";
-import UsingHumanAnnotators from "./chapters/human-evaluation/using-human-annotators.mdx";
-import HumanEvaluationTips from "./chapters/human-evaluation/tips-and-tricks.mdx";
-import ModelAsJudgeBasics from "./chapters/model-as-a-judge/basics.mdx";
-import GettingJudgeLLM from "./chapters/model-as-a-judge/getting-a-judge-llm.mdx";
-import DesigningEvaluationPrompt from "./chapters/model-as-a-judge/designing-your-evaluation-prompt.mdx";
-import EvaluatingYourEvaluator from "./chapters/model-as-a-judge/evaluating-your-evaluator.mdx";
-import WhatAboutRewardModels from "./chapters/model-as-a-judge/what-about-reward-models.mdx";
-import ModelAsJudgeTips from "./chapters/model-as-a-judge/tips-and-tricks.mdx";
 import TroubleshootingInference from "./chapters/troubleshooting/troubleshooting-inference.mdx";
 import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
 import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
- https://x.com/sasuke___420/status/1984168256568226286
 <Intro />
@@ -49,19 +39,23 @@ Now that you have an idea of why evaluation is important, and how it's done, let
 <ModelInferenceAndEvaluation />
-## Doing evaluations with existing benchmarks
-### State of evaluations in 2025
 <EvalsIn2025 />
-### Understanding what's in an eval
-Ok, we made a list of benchmarks, now what? Well, now you need to check if these benchmarks are relevant for you and your specific use cases (unless you just want to compare your model to other models, in which case you can skim and go to the next section).
-The first and most important step is, and always will be, to look at the data. You want to study the following.
-#### Creation process
 - **Who created the actual samples?**
 Ideally, you want dataset created by experts, then next tier is paid annotators, then crowdsourced, then synthetic, then MTurked. You also want to look for a data card, where you'll find annotator demographics - this can be important to understand the dataset language diversity, or potential cultural bias.
@@ -72,7 +66,7 @@ This is especially important for datasets with the help of underpaid annotators
 - **Were the annotators provided with clear data creation guidelines?**
 In other words, is your dataset consistent?
-#### Samples
 Take 50 random samples and manually inspect them; and I mean do it yourself, not "prompt an LLM to find unusual stuff in the data for you".
 First, you want to check the content quality. Are the prompts clear and unambiguous? Are the answers correct? (*Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.*) Is information missing? (*Eg: MMLU misses reference schematics in a number of questions.*) It's important to keep in mind that it's not because a dataset is a standard that it's a good one - and this happens because most people skip this step.
@@ -91,52 +85,24 @@ You want to check what metrics are used: are they automatic, functional, or usin
 Best (but rarest) metrics are functional or based on rule based verifiers (though beware of pass/fail for coding models and code evaluations, as recent LLMs have become very good at overwriting globals to 'cheat' on such tests, especially in languages like Python where you can mess up variable scope).
-### Troubleshooting reproducibility
 <TroubleshootingReproducibility />
-### Selecting good evaluations for ablations
-<PickingYourEval />
-For these ablations, it's good to focus on tasks that give good early signal and avoid noisy benchmarks. In [FineTasks](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) and [FineWeb2](https://arxiv.org/pdf/2506.20920), reliable evaluation tasks are defined by four key principles:
--  **Monotonicity:**  The benchmark scores should consistently improve as models train longer.
--  **Low noise:**  When we train models with the same setup but different random seeds, the benchmark scores shouldn't vary wildly.
--  **Above-random performance:**  Many capabilities only emerge later in training, so tasks that show random-level performance for extended periods aren't useful for ablations. This is the case, for example, for MMLU in multiple choice format as we will explain later.
--  **Ranking consistency:**  If one approach outperforms another at early stages, this ordering should remain stable as training continues.
-## So you want to create your own evaluation
-### Automated Benchmarks
 <DesigningAutomaticEvaluation />
-<AutomatedBenchmarksTips />
-### Human Evaluations
-<HumanEvaluationBasics />
-<UsingHumanAnnotators />
-<HumanEvaluationTips />
-### Model judges
 https://x.com/Kangwook_Lee/status/1993438649963164121
-<ModelAsJudgeBasics />
-<GettingJudgeLLM />
-<DesigningEvaluationPrompt />
-<EvaluatingYourEvaluator />
-<WhatAboutRewardModels />
-<ModelAsJudgeTips />
 <TroubleshootingInference />

 import Intro from "./chapters/intro.mdx";
 import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx";
+import PickingYourEval from "./chapters/general-knowledge/picking-your-evaluation.mdx";
+import EvalsIn2025 from "./chapters/general-knowledge/2025-evaluations-for-useful-models.mdx"
 import TroubleshootingInference from "./chapters/troubleshooting/troubleshooting-inference.mdx";
 import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
 import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
+- https://arxiv.org/abs/2109.02550
+- https://arxiv.org/abs/2511.21140
 <Intro />
 <ModelInferenceAndEvaluation />
+## Evaluating with existing benchmarks
+### Benchmarks to know in 2025
 <EvalsIn2025 />
+### Selecting good benchmarks automatically for model training
+<PickingYourEval />
+### Understanding what's in there
+No matter how you selected your initial datasets, the most important step is, and always will be, to look at the data, both what you have, what the model generates, and its scores. In the end, that's the only way you'll see if your evaluations are actually relevant for your specific use case.
+You want to study the following.
+#### Data creation process
 - **Who created the actual samples?**
 Ideally, you want dataset created by experts, then next tier is paid annotators, then crowdsourced, then synthetic, then MTurked. You also want to look for a data card, where you'll find annotator demographics - this can be important to understand the dataset language diversity, or potential cultural bias.
 - **Were the annotators provided with clear data creation guidelines?**
 In other words, is your dataset consistent?
+#### Samples inspection
 Take 50 random samples and manually inspect them; and I mean do it yourself, not "prompt an LLM to find unusual stuff in the data for you".
 First, you want to check the content quality. Are the prompts clear and unambiguous? Are the answers correct? (*Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.*) Is information missing? (*Eg: MMLU misses reference schematics in a number of questions.*) It's important to keep in mind that it's not because a dataset is a standard that it's a good one - and this happens because most people skip this step.
 Best (but rarest) metrics are functional or based on rule based verifiers (though beware of pass/fail for coding models and code evaluations, as recent LLMs have become very good at overwriting globals to 'cheat' on such tests, especially in languages like Python where you can mess up variable scope).
+### So, you can't reproduce reported model scores?
 <TroubleshootingReproducibility />
+## Creating your own evaluation
 <DesigningAutomaticEvaluation />
 https://x.com/Kangwook_Lee/status/1993438649963164121
 <TroubleshootingInference />

app/src/content/assets/finetasks/code.js ADDED Viewed

	@@ -0,0 +1,245 @@

+import Papa from 'papaparse';
+import { DataTable } from 'simple-datatables';
+const languageMap = {
+  'Arabic': 'ar',
+  'Turkish': 'tr',
+  'Swahili': 'sw',
+  'Russian': 'ru',
+  'Telugu': 'te',
+  'Thai': 'th',
+  'Chinese': 'zh',
+  'French': 'fr',
+  'Hindi': 'hi',
+};
+const metricTypes = [
+  { value: 'max_score', label: 'Max Score' },
+  { value: 'avg_snr', label: 'Low Noise' },
+  { value: 'avg_spearman', label: 'Monotonicity' },
+  { value: 'max_n_std', label: 'Non-Randomness' },
+  { value: 'avg_kendall_tau_a', label: 'Ordering Consistency' }
+];
+const tableTypes = [
+  { value: 'gen', label: 'Generative' },
+  { value: 'mc', label: 'Multichoice' }
+];
+const taskFolders = [
+  { value: 'selected', label: 'FineTasks' },
+  { value: 'non_selected', label: 'Non-Selected' }
+];
+function createDropdown(options, onChange) {
+  const select = document.createElement('select');
+  options.forEach(option => {
+    const optionElement = document.createElement('option');
+    if (typeof option === 'object' && option.value && option.label) {
+      optionElement.value = option.value;
+      optionElement.textContent = option.label;
+    } else {
+      optionElement.value = option;
+      optionElement.textContent = option;
+    }
+    select.appendChild(optionElement);
+  });
+  select.addEventListener('change', onChange);
+  return select;
+}
+function createPerTaskResultsTable(data, tableType, metric) {
+  const tableWrapper = document.createElement('div');
+  tableWrapper.className = 'table-wrapper fine-tasks-table-wrapper';
+  const table = document.createElement('table');
+  table.className = 'results-table fine-tasks-results-table';
+  const columns = ['Task', 'Type', ...(tableType === 'gen' ? ['f1', 'prefix_match'] : ['acc', 'acc_norm', 'acc_norm_token', 'acc_norm_pmi'])];
+  const columnNameMap = {
+    // 'Task': 'Task',
+    // 'Type': 'Type',
+    // 'f1': 'f1',
+    // 'prefix_match': 'prefix_match',
+    // 'acc': 'acc',
+    'acc_norm': 'acc_char',
+    'acc_norm_token': 'acc_token',
+    'acc_norm_pmi': 'acc_pmi',
+    'prefix_match': 'prefix'
+  };
+  const taskMetricMap = {
+    'max_score': 'score',
+    'avg_snr': 'snr',
+    'avg_spearman': 'monotonicity',
+    'max_n_std': 'non-randomness',
+    'avg_kendall_tau_a': 'ordering'
+    // 'avg_spearman': 'monotonicity',
+  }
+  const header = table.createTHead();
+  const headerRow = header.insertRow();
+  columns.forEach(column => {
+    const th = document.createElement('th');
+    th.textContent = columnNameMap[column] || column;
+    if (th.textContent !== "Task" && th.textContent !== "Type") {
+        th.textContent += " " + (taskMetricMap[metric] || metric);
+    }
+    th.title = th.textContent;
+    if (column === 'Type')
+      th.style.width = '40px';
+    headerRow.appendChild(th);
+  });
+  const body = table.createTBody();
+  data.forEach(row => {
+    if (Object.values(row).every(value => value === '' || value === undefined || value === null)) {
+      return;
+    }
+    const tr = body.insertRow();
+    columns.forEach(column => {
+      const td = tr.insertCell();
+      let value = row[column];
+      if (column === 'Task') {
+        const fullTaskName = value; // Store the full task name
+        const parts = value.split('|');
+        value = parts.length > 1 ? parts[1] : value;
+        value = value.split('_mcf')[0].split('_cf')[0];
+        td.title = fullTaskName; // Set the title attribute to show the full name on hover
+      } else if (column === 'Type') {
+        // Keep the task type as is
+      } else if (typeof value === 'number') {
+        value = value.toFixed(2);
+      } else if (value && !isNaN(parseFloat(value))) {
+        value = parseFloat(value).toFixed(2);
+      } else {
+        value = '';
+      }
+      td.textContent = value;
+    });
+  });
+  tableWrapper.appendChild(table);
+  return tableWrapper;
+}
+export function initFineTasks(containerId) {
+  const container = document.getElementById(containerId);
+  if (!container) return;
+  const perTaskTitleElement = document.createElement('h3');
+  perTaskTitleElement.textContent = 'Task Results';
+  perTaskTitleElement.className = 'fine-tasks-title';
+  const perTaskTableContainer = document.createElement('div');
+  perTaskTableContainer.className = 'table-container';
+  let perTaskDataTable;
+  function updatePerTaskResults() {
+    const language = languageDropdownPerTask.value;
+    const metric = metricDropdownPerTask.value;
+    const tableType = tableTypeDropdownPerTask.value;
+    const taskFolder = taskFolderDropdownPerTask.value;
+    const languageCode = languageMap[language];
+    if (!languageCode) {
+      console.error(`Language code not found for ${language}`);
+      perTaskTableContainer.innerHTML = `<p>Error: Language code not found for ${language}</p>`;
+      return;
+    }
+    let url = `data/tasks/${taskFolder}/${languageCode}/${metric}/${tableType}_stats.csv`;
+    fetch(url)
+      .then(response => {
+        if (!response.ok) {
+          throw new Error(`HTTP error! status: ${response.status}`);
+        }
+        return response.text();
+      })
+      .then(csvText => {
+        const results = Papa.parse(csvText, { header: true }).data;
+        perTaskTableContainer.innerHTML = '';
+        const tableWrapper = createPerTaskResultsTable(results, tableType, metric);
+        perTaskTableContainer.appendChild(tableWrapper);
+        if (perTaskDataTable) {
+          perTaskDataTable.destroy();
+        }
+        perTaskDataTable = new DataTable('.fine-tasks-results-table', {
+          perPage: 10,
+          perPageSelect: false,
+          searchable: false,
+          sortable: true,
+          fixedHeight: true,
+          labels: {
+            info: ''  // This removes the "Showing 1 to X of Y entries" text
+          }
+        });
+      })
+      .catch(error => {
+        console.error('Error fetching CSV:', error);
+        perTaskTableContainer.innerHTML = `<p>Error loading data: ${error.message}</p>`;
+      });
+  }
+  const perTaskControls = document.createElement('div');
+  perTaskControls.className = 'controls fine-tasks-controls';
+  // Task folder control group
+  const taskFolderControlGroup = document.createElement('div');
+  taskFolderControlGroup.className = 'control-group';
+  const taskFolderLabelPerTask = document.createElement('label');
+  taskFolderLabelPerTask.textContent = 'Task Set: ';
+  const taskFolderDropdownPerTask = createDropdown(taskFolders, updatePerTaskResults);
+  taskFolderDropdownPerTask.value = 'selected'; // Set default to FineTasks
+  taskFolderControlGroup.appendChild(taskFolderLabelPerTask);
+  taskFolderControlGroup.appendChild(taskFolderDropdownPerTask);
+  // Language control group
+  const languageControlGroup = document.createElement('div');
+  languageControlGroup.className = 'control-group';
+  const languageLabelPerTask = document.createElement('label');
+  languageLabelPerTask.textContent = 'Language: ';
+  const languageDropdownPerTask = createDropdown(Object.keys(languageMap), updatePerTaskResults);
+  languageControlGroup.appendChild(languageLabelPerTask);
+  languageControlGroup.appendChild(languageDropdownPerTask);
+  // Table type control group
+  const tableTypeControlGroup = document.createElement('div');
+  tableTypeControlGroup.className = 'control-group';
+  const tableTypeLabelPerTask = document.createElement('label');
+  tableTypeLabelPerTask.textContent = 'Type: ';
+  const tableTypeDropdownPerTask = createDropdown(tableTypes, updatePerTaskResults);
+  tableTypeControlGroup.appendChild(tableTypeLabelPerTask);
+  tableTypeControlGroup.appendChild(tableTypeDropdownPerTask);
+  // Metric control group
+  const metricControlGroup = document.createElement('div');
+  metricControlGroup.className = 'control-group';
+  const metricLabelPerTask = document.createElement('label');
+  metricLabelPerTask.textContent = 'Criteria: ';
+  const metricDropdownPerTask = createDropdown(metricTypes, updatePerTaskResults);
+  metricDropdownPerTask.value = 'max_score'; // Set default to Max Score
+  metricControlGroup.appendChild(metricLabelPerTask);
+  metricControlGroup.appendChild(metricDropdownPerTask);
+  perTaskControls.appendChild(taskFolderControlGroup);
+  perTaskControls.appendChild(languageControlGroup);
+  perTaskControls.appendChild(tableTypeControlGroup);
+  perTaskControls.appendChild(metricControlGroup);
+  container.appendChild(perTaskControls);
+  // container.appendChild(perTaskTitleElement);
+  container.appendChild(perTaskTableContainer);
+  // Initialize with default values
+  updatePerTaskResults();
+}

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -4,45 +4,23 @@ title: "Designing your automatic evaluation"
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
-### Designing your automatic evaluation
-#### Selecting or creating a dataset
-For your evaluation, you can either select an existing dataset or design your own. Through this process, it's very important to keep in mind that **your evaluation result will only be as good as your evaluation dataset**.
-##### Designing your own
-You can go 3 ways when designing your own dataset.
-- **Aggregating existing data**: You can aggregate existing data from different sources, evaluating a relevant capability for your task. A number of evaluation datasets are for example constructed from aggregating human evaluation datasets (such as MATH, LSAT, etc). In this case, follow the steps above.
-- **Using human annotators**: There's a whole section on using human annotators in `Human evaluation`, see [Using human annotators](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/using-human-annotators.md).
-<Note title="See also" emoji="👥" variant="info">
-For detailed guidance on using human annotators to create evaluation datasets, see the [Using human annotators](/human-evaluation/using-human-annotators) section.
-</Note>
 - **Using synthetic data from models**: On this, you can check the very cool [Cosmopedia](https://huggingface.co/blog/cosmopedia) blog by cool HF colleagues! It's mostly studying how to create a synthetic training dataset, but similar techniques can be used for evaluation. Make sure to manually check/filter/inspect your dataset afterwards (following the above steps).
 - **Using rule-based techniques**: If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination! For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), etc.
-#### Choosing an inference method for your model
-You'll need to choose what kind of inference method you need.
-Using log-probabilities (MCQA, multi-choice question answer) is very good for multiple choice question answers (usually to test model knowledge, or ability to disambiguate).
-- Pros:
-	- Makes sure that all models have access to the correct answer
-	- Provides a proxy for model "confidence" (and calibration)
-	- Fast to evaluate, especially when we ask the model to predict only one token (A/B/C/D the indices of the choices, or Yes/No, etc).
-	- Allow to get signal on small models' task performance
-- Cons:
-	- Slightly over-scores small models which would have generated something outside of the range of available choices if given free rein.
-	- Some models [favor specific choices based on the order in which they have been presented](https://arxiv.org/abs/2309.03882), which could lead to unrepresentative evaluations
-Using generations (QA, question answering) is very good for any task where you want to test fluency, reasoning, or the ability of your model to actually answer questions.
-- Pros:
-	- Should actually correlates with LLM ability to generate fluent text, will most of the time be what people are actually interested in
-- Cons:
-	- Can be harder to score (see the `metrics` section below)
-	- Usually slightly more expensive than log likelihood evaluations, especially if they include sampling
 #### Choosing a prompt
 The prompt is going to define:
 - how much information is given to your model about the task
@@ -62,37 +40,105 @@ When defining your prompt, you need to be aware that:
 		- A costly way is to re-run the evaluation several times with prompt variations
 		- A less costly way is to run your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty
 - you can provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall
-- but models now tend to overfit specific prompt formats.
-	- [This paper](https://arxiv.org/abs/2407.07890) is great on the topic, showing notably how some models can be over-evaluated because they have overfitted the test set **format**
-	- On the Open LLM Leaderboard 2, we've notably observed that Llama 3.2 and Qwen 2.5 are no longer following the format of the prompt provided in a few-shot setup for this reason.
 <Note title="Models can overfit prompt formats" emoji="⚠️" variant="warning">
-Recent research shows models can overfit specific prompt formats rather than learning the underlying task. [This paper](https://arxiv.org/abs/2407.07890) demonstrates how some models are over-evaluated because they've memorized test set formats. We've observed Llama 3.2 and Qwen 2.5 no longer following few-shot prompt formats for this reason.
 </Note>
-- for a number of metrics, you want a very constrained generation or output.
-  *You can learn more about this in the `Constraining model outputs` section of the [Model inference and evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/model-inference-and-evaluation.md) page.*
-#### Choosing a metric
-If you are looking at **log-probabilities**, your metrics are going to be easy: you'll want to look at accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
-For **generative** evaluations, your range of metrics is going to be wider.
-You'll need to
-1. decide if you compare generations as they are, or first normalize them with something.
-	- Normalizations can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.
-<Sidenote>
-Normalizations can [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), though they generally provide useful signal. Design normalization rules carefully and test them across diverse model outputs.
 </Sidenote>
-	- They are very important for specific tasks, such as math evaluations, where you might want to extract your result from formatted outputs.
-	- They will also be important if you want to evaluate with added mechanisms for accuracy, such as Chain of Thought, as you'll need to remove the reasoning trace from the actual result
-2. decide how you compare the generation with the reference.
-   You could use anything ranging from match-based metrics (exact match, prefix match, etc) to summarization and translation metrics (ROUGE, BLEU, character n gram comparisons). For a list of existing metrics, you can look [here](https://github.com/huggingface/lighteval/wiki/Metric-List), I'll add a section later on which metric to use when.
-More generally, when picking your metric, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc). (*To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/)*)
-#### Smart new tasks: what about functional testing?
 In the field of code, you want to evaluate generated programs not only on their semantics, but on their actual function. A good way to do so is therefore to check if code generated to follow a prompt passes correctly a suite of unit-tests designed to fit the task.
 This functionality approach is extremely promising, as it
@@ -100,15 +146,311 @@ This functionality approach is extremely promising, as it
 - therefore reducing overfitting
 - tests models on specific active capabilities
-<Note title="The promise of functional testing" emoji="✨" variant="success">
-Functional testing (like unit tests for code) offers major advantages:
-- Easier test case generation (often rule-based)
-- Reduces overfitting risk
-- Tests specific active capabilities
-- Extends beyond code to other domains (e.g., IFEval for instruction following)
 </Note>
-It's however an approach which requires creativity to be translated to text!
-A good example of this is IFEval, an evaluation benchmark which tests if models can follow instructions. It works by creating a number of formatting instructions (*Add this number of bullet points. Capitalize only one sentence.* etc), and strictly testing if the format is followed. More work is clearly needed to extend this idea to other features of text to analyze!

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
+import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
+### Dataset
+#### Using existing data
+- Use existing datasets, and assemble them differently
+You can aggregate existing data from different sources, evaluating a relevant capability for your task. A number of evaluation datasets are for example constructed from aggregating human evaluation datasets (such as MATH, LSAT, etc). In this case, follow the steps above.
+#### Creating a dataset manually
+<UsingHumanAnnotators />
+#### Creating a dataset synthetically
 - **Using synthetic data from models**: On this, you can check the very cool [Cosmopedia](https://huggingface.co/blog/cosmopedia) blog by cool HF colleagues! It's mostly studying how to create a synthetic training dataset, but similar techniques can be used for evaluation. Make sure to manually check/filter/inspect your dataset afterwards (following the above steps).
 - **Using rule-based techniques**: If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination! For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), etc.
 #### Choosing a prompt
 The prompt is going to define:
 - how much information is given to your model about the task
 		- A costly way is to re-run the evaluation several times with prompt variations
 		- A less costly way is to run your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty
 - you can provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall
+- for a number of metrics, you want a very constrained generation or output.
 <Note title="Models can overfit prompt formats" emoji="⚠️" variant="warning">
+Recent research shows models can overfit specific prompt formats rather than learning the underlying task. [This paper](https://arxiv.org/abs/2407.07890) is great on the topic, showing notably how some models can be over-evaluated because they have overfitted the test set **format**.
+On the Open LLM Leaderboard 2, we've notably observed that Llama 3.2 and Qwen 2.5 are no longer following the format of the prompt provided in a few-shot setup for this reason.
 </Note>
+#### Managing contamination
+In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
+Solutions to mitigate this include:
+- providing a **canary string** in the evaluation set (like in [BigBench](https://github.com/google/BIG-bench)): it is a specific character combination that model creators can look for in their training sets, which would indicate that it contains an evaluation
+- providing evaluation sets in **[encrypted](https://arxiv.org/abs/2309.16575) or [gated](https://huggingface.co/datasets/Idavidrein/gpqa)** forms so that they can't be parsed easily by web crawlers - therefore not ending up accidentally in training sets
+- running [dynamic benchmarks](https://arxiv.org/abs/2104.14337): benchmarks regularly updated through time so that models can't "learn the answers by heart" (but it makes datasets more costly)
+- if you are running a benchmark, trying to [detect contamination](https://arxiv.org/abs/2311.06233) post-hoc (for example, by looking at the generation perplexity or designing adversarial versions of the prompts - however, no method is a foolproof contamination detection method)
+However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
+### Choosing an inference method for your model
+You'll need to choose what kind of inference method you need.
+#### Loglikelihood evaluations
+<Note title="Reminder">
+Using log-probabilities (MCQA, multi-choice question answer) is good for multiple choice question answers (usually to test model knowledge, or ability to disambiguate).
+- Pros:
+	- Makes sure that all models have access to the correct answer
+	- Provides a proxy for model "confidence" (and calibration)
+	- Fast to evaluate, especially when we ask the model to predict only one token (A/B/C/D the indices of the choices, or Yes/No, etc).
+	- Allow to get signal on small models' task performance
+- Cons:
+	- Slightly over-scores small models which would have generated something outside of the range of available choices if given free rein.
+	- Some models [favor specific choices based on the order in which they have been presented](https://arxiv.org/abs/2309.03882), which could lead to unrepresentative evaluations (unless you're re-running the evaluation n times by shuffling samples orders, which you should do for significance if you have the budget for!)
+</Note>
+<Note title="Tip: an easy speed up for MCQA evaluations">
+You can speed up your MCQA predictions by a lot if you make sure your model needs to predict only one token for the task.
+This way, instead of running your `number_of_choices` predictions (`context + choice 1`, `context + choice 2`, etc), you can simply run inference on `context` and compute the probability distribution on the full vocabulary (which will include all your one token choices) to get your logprobabilities of interest, and do this step in one pass.
+</Note>
+#### Generative evaluations
+However, nowadays most evaluations are generative: using generations (QA, question answering) is very good for any task where you want to test fluency, reasoning, or the ability of your model to actually answer questions. It's also the most relevant way to evaluate reasoning models.
+<Note title="Reminder">
+- Pros:
+	- Should actually correlates with LLM ability to generate fluent text, will most of the time be what people are actually interested in
+	- The only way to evaluate both closed and open source models
+- Cons:
+	- Can be harder to score (see below)
+	- More expensive than log likelihood evaluations, especially if they include sampling or reasoning models
+</Note>
+### Scoring
+If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
+If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something.
+<Note title="Normalization">
+Normalizations can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.
+They are very important for specific tasks, such as math evaluations, where you might want to extract your result from formatted outputs.
+They will also be important if you want to evaluate with added mechanisms for accuracy, such as Chain of Thought, as you'll need to remove the reasoning trace from the actual result
+</Note>
+Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
+## The hardest part of evaluation: Scoring free form text
+### Automatically
+#### Metrics
+You could use anything ranging from match-based metrics (exact match, prefix match, etc) to summarization and translation metrics (ROUGE, BLEU, character n gram comparisons).
+TODO: Add shortcimings of different metrics
+More generally, when picking your metric, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
+<Sidenote>
+To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/).
 </Sidenote>
+<Note title="Pros and cons of using automated metrics">
+Automated benchmarks have the following advantages:
+- **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
+- **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
+- **Understandability**: Most automated metrics are very understandable.
+  *Eg: an exact match will tell you if the generated text matches perfectly with the reference, and an accuracy score will tell you in how many cases the selected choice was the correct one (this will be a bit less the case for metrics such as `BLEU` or `ROUGE` for example).*
+However, they also present the following limitations:
+- **Reduced use on more complex tasks**: Automated benchmarks are working well for tasks where performance is easy to define and assess (for example, classification). More complex capabilities, on the other hand, are harder to decompose into well-defined and precise tasks.
+  *Eg: what does "good at math" mean? Is it being good at arithmetic? - at logic? - able to reason on new mathematical concepts?*
+  This led to the use of more **generalist** evaluations, which no longer decompose capabilities in sub-tasks, but assuming that general performance will be a **good proxy** for what we aim to measure.
+- **Contamination**: Once a dataset is published publicly in plain text, it will end up in model training datasets. This means that you have no guarantee when scoring a model that it has not parsed the evaluation data before.
+</Note>
+#### Using functional testing
 In the field of code, you want to evaluate generated programs not only on their semantics, but on their actual function. A good way to do so is therefore to check if code generated to follow a prompt passes correctly a suite of unit-tests designed to fit the task.
 This functionality approach is extremely promising, as it
 - therefore reducing overfitting
 - tests models on specific active capabilities
+It's however an approach which requires creativity to be translated to text!
+A good example of this is IFEval, an evaluation benchmark which tests if models can follow instructions. It works by creating a number of formatting instructions (*Add this number of bullet points. Capitalize only one sentence.* etc), and strictly testing if the format is followed. More work is clearly needed to extend this idea to other features of text to analyze!
+### With humans
+Human evaluation is simply asking humans to score predictions.
+Human evaluation is very interesting, because it's **flexibility** (if you define clearly enough what you are evaluating, you can get scores for about anything!), **uncontaminated** (If you ask humans to write new questions to test your system, they should not be present in your training data (hopefully)), and correlates well with human preference for obvious reasons.
+<Sidenote>
+However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.*
+</Sidenote>
+However, it also present a number of biases:
+- **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
+- **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.)
+- **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
+- **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
+There are 3 main ways to do evaluation with paid annotators:
+- If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning.
+- If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans.
+- If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
+Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
+However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
+Two other approaches exist to do human-based evaluation in a more casual way:
+- **Vibes-checks**: manual evaluations done by individuals to get an overall feeling of how well models perform on many use cases (from coding to quality of smut written). Often shared on Twitter and Reddit, results mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for). However, they can be [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need).
+- **Arenas**: crowdsourced human evaluation to rank models. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best".
+Pros of casual human evaluations are that they are cheap, scale better and allow to discover fun edge cases since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases. However, the obvious issues are that it's easy to game externally, you can't mitigate the **high subjectivity** easily, it's usually not representative of the broader population as since young western men are over re-represented on tech-sides of the internet (both in terms of topics explored and overall rankings).
+<Sidenote>
+it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
+</Sidenote>
+### With judge models
+Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
+Judge models range from small specialized classifiers (think "spam filter", but for toxicity for example) to LLMs, either large and generalist or small and specialized. In the latter case, when using an LLM as a judge, you give it a prompt to explain how to score models (ex: `Score the fluency from 0 to 5, 0 being completely un-understandable, ...`).
+Model as judges allow to score text on complex and nuanced properties.
+For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
+That's where models as judges come into play.
+They are used on 3 main tasks:
+- *Scoring a model generation*, on a provided scale, to assess a property of the text (fluency, toxicity, coherence, persuasiveness, etc).
+- *Pairwise scoring*: comparing a pair model outputs to pick the best text with respect to a given property
+- *Computing the similarity* between a model output and a reference
+*Note: In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md))*
+#### Pros and cons of using judge-LLMs
+People in favor of judge LLMs have been claiming they provide better:
+- **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner (theoretically - in my opinion, they add more subtle bias than they are worth)
+- **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data (if you control for temperature).
+- **Cost**: They are cheap to instantiate, as they don't require to train a new model, and can just rely on good prompting and an existing high quality LLM. They are also cheaper than paying actual human annotators (capitalism...).
+- **Alignment with human judgments**: They are somehow correlated with human judgments.
+In my opinion, using LLM judges correctly is extremely tricky, and it's easy to be deceived for critical use cases:
+- LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see [model-as-a-judge/Tips and tricks]). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
+- They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
+- They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
+<Note title="Critical limitations of LLM judges" emoji="⚠️" variant="warning">
+Using LLM judges is extremely tricky:
+- **Hidden biases**: Harder to detect than human biases; creates echo-chamber effects
+- **Data overload**: Generates massive synthetic data needing quality examination
+- **False objectivity**: Seems objective but reinforces subtle biases
+- **Expert humans better**: For critical use cases, expert annotators provide higher quality
+See [Tips and tricks](./tips-and-tricks) for bias mitigation strategies.
 </Note>
+This section is a bit long, because you need to be well aware of their limitations: a lot of people are blindly jumping into using model judges because they seem easier, but then end up with uninsterpretable data with tricky bias to extract.
+If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) (⭐) by Aymeric Roucher on how to setup your first LLM as judge!
+You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.
+#### Getting a Judge-Model
+When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4),  using [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
+**Using a generalist LLM**
+With the introduction of more capable LLMs (such as ChatGPT), some researchers started exploring using big models as judges.
+<Note title="Closed vs open source judge models" emoji="⚖️" variant="warning">
+**Closed source models (Claude, GPT-o) tradeoffs:**
+Disadvantages:
+- **Non-reproducible**: Models can change without notice via API updates
+- **Black box**: Un-interpretable decision-making
+- **Privacy risks**: Data sent to third parties, potential leakage
+Advantages:
+- Easy access without local setup or hardware requirements
+**Open source models are closing the gap** while solving reproducibility and interpretability issues. Models like DeepSeek R1, gpt-oss, and the recent Qwen models are now competitive alternatives.
+</Note>
+You'll find a good cost analysis of model providers [here](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard) if you need help picking one.
+**Using a tiny specialized LLM judge model**
+You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
+Some existing models:
+- Flow-Judge-v0.1 ([weights](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)), 3.8B parameters, a Phi-3.5-mini-instruct fine-tuned on a synthetic preference dataset
+- Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset. A 7B parameter [v2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) also exists, a Mistral-7B-Instruct-v0.2 fine-tune on a bigger synthetic preference dataset, with added weight merging
+- JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models.
+**Training your own**
+You can also make the choice to train or fine-tune your own LLM-as-judge. (I would avoid doing this, unless you are on a very niche topic).
+You first need to gather preference data for your task of interest, which can come
+- From existing [human preference datasets](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
+- From model generated preference data (which you can generate following the above tiny-model judges papers data sections, or get directly, for example from the Prometheus [preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) and [feedback](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) collections).
+Then you need to decide whether to start from a small model to train from scratch, or from an existing model, that you can
+- distill into a new smaller model
+- quantize.
+- then fine-tune (using peft or adapter weights if the model is big and your training compute low) using the above data
+	- apparently [starting from a reward model works better than from an instruct model](https://x.com/dk21/status/1826292289930674590)
+#### Designing your evaluation prompt
+Once you've selected your model, you need to define what is the best possible prompt for your task.
+Some general guidelines I've come across online when designing the prompt itself are:
+- Provide a clear description of the task at hand:
+	- `Your task is to do X`.
+	- `You will be provided with Y`.
+- Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
+	- `You should evaluate property Z on a scale of 1 - 5, where 1 means ...`
+	- `You should evaluate if property Z is present in the sample Y. Property Z is present if ...`
+- Provide some additional "reasoning" evaluation steps:
+	- `To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...`
+- Specify the desired output format (adding fields will help consistency)
+	- `Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}`
+<Note title="Core prompt design principles" emoji="📝" variant="info">
+**Essential elements for effective judge prompts:**
+- **Clear task description**: Specify exactly what the judge needs to do
+- **Detailed criteria**: Provide explicit scoring scales with clear definitions
+- **Reasoning steps**: Guide the judge through the evaluation process
+- **Structured output**: Use JSON format for consistency and parsability
+</Note>
+You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
+Other tidbits:
+- Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
+- If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
+- Using one prompt per capability to score tends to give better and more robust results
+You can also improve accuracy using the following, possibly more costly, techniques:
+- **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
+- **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy
+- **CoT**: [improves accuracy for older gen models](https://arxiv.org/abs/2212.08073), if you ask the model to output its chain of thought **before** the score (also observed [here](https://x.com/seungonekim/status/1749289437165769177))
+- **Multiturn analysis**: can improve [factual error detection](https://arxiv.org/abs/2305.13281)
+- Using **a jury** (many judges, where you pick an aggregate of the answers): [gives better results](https://arxiv.org/abs/2404.18796) than using a single model.
+	- It can be made considerably less costly by leveraging many smaller models instead of one big expensive model.
+	- You can also experiment with using one model with variations on temperature
+- Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
+<Note title="High-stakes evaluation requires rigor" emoji="⚠️" variant="warning">
+For production or critical use cases, use methodologies transferred from the humanities:
+- Compute inter-annotator agreement metrics
+- Use proper survey design methodology to mitigate bias
+</Note>
+However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
+#### Evaluating your evaluator
+Before using a judge-LLM in production or at scale, you want to evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
+<Note>
+This will be easier to do if it predicts binary outputs, because you'll be able to interpretable classification metrics (accuracy/recall/precision). If it predicts scores on a scale, it will be much harder to estimate the quality of the correlation with a reference.*
+</Note>
+So, once you have selected your model judge and its prompt, you'll need to do the following.
+1. **Pick your baseline**
+You'll need to compare your evaluator judgments to a baseline: it can be human annotations, the output of another judge model that you know is qualitative on your task, a gold truth, itself with another prompt, etc.
+<Note title="Quality over quantity for baseline" emoji="🎯" variant="info">
+You don't need many baseline examples (50 can suffice), but they must be:
+- **Representative**: Cover the full range of your task
+- **Discriminative**: Include edge cases and challenging examples
+- **High quality**: Use the best reference data you can obtain
+</Note>
+2. **Pick your metric**
+Your metric will be used to compare your judge's evaluations with your reference.
+In general, this comparison is considerably easier to do if your model is predicting binary classes or doing pairwise comparison, as you'll be able to compute accuracy (for pairwise comparison), or precision and recall (for binary classes), which are all very easy to interpret metrics.
+Comparing the correlation of scores with human or model scoring will be harder to do. To understand why in more detail, I advise you to read this cool [blog section on the topic](https://eugeneyan.com/writing/llm-evaluators/#key-considerations-before-adopting-an-llm-evaluator).
+In general, if you're a bit lost about what metrics to pick when (in terms of models, metrics, ...), you can also look at [this interesting graph](https://eugeneyan.com/assets/llm-eval-tree.jpg) from [the same above blog](https://eugeneyan.com/writing/llm-evaluators/) ⭐.
+3. **Evaluate your evaluator**
+For this step, you simply need to use your model and its prompt to evaluate your test samples! Then, once you get the evaluations, use your above metric and reference to compute a score for your evaluations.
+You need to decide what your threshold for acceptance is. Depending on how hard your task is, you can aim for 80% to 95% accuracy, if you're doing pairwise comparison. Regarding correlations (if you're using scores), people in the literature tend to seem happy with 0.8 Pearson correlation with a reference. However, I've seen some papers declare that 0.3 indicates a good correlation with human annotators (^^") so ymmv.
+#### Tips and tricks
+**Mitigating well known biases of LLM as judges**
+<Note title="Known LLM judge biases and mitigations" emoji="⚠️" variant="warning">
+- **Lack of internal consistency**: a judge might give you different judgments if you prompt it several times (if the temperature is not 0)
+	- You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
+- **Self-preference**: they tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
+	- You can mitigate this by using a jury
+- **Blindness to input perturbation**: models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale.
+	- You can mitigate this by
+		- asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
+		- providing a coherent grading scale in the prompt.
+- **Position-bias**: they tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice
+	- You can mitigate this by
+		- switching answer positions randomly
+		- computing the log-probabilities of all possible choices to get a normalized answer
+- **Verbosity-bias** (or length-bias): they tend to like more verbose answers
+	- You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
+- **Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
+	- However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.
+- **Format bias**: they tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
+	- You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
+</Note>
+**Picking correct tasks for an LLM judge**
+LLM evaluators:
+- are **bad at identifying hallucinations** in general, particularly what are called partial hallucinations (which look close to the ground truth but are actually slightly different) (see [this](https://arxiv.org/abs/2305.11747) and [this](https://arxiv.org/abs/2303.08896))
+- have a low to OK-ish correlation with human annotators on [summarization](https://arxiv.org/abs/2304.02554) ([here too](https://arxiv.org/abs/2303.16634)), [faithfulness](https://arxiv.org/abs/2307.16877), and are not consistently correlated with human judgement more broadly against [a scope of tasks](https://arxiv.org/abs/2406.18403)
+#### What about Reward Models?
+Reward models learn to predict a score from human annotations for given prompt/completion pairs. The end goal is for them to do predictions aligned with human preference.
+Once trained, these models can then be used to improve other models, by acting as a a reward function which is a proxy for human judgment.
+The most common type of reward model is the Bradley-Terry model, which outputs a single **pairwise score**, following:
+$$p(\text{completion b is better than completion a}) = \text{sigmoid}(\text{score}_b - \text{score}_a)$$
+This model is trained using only pairwise comparisons of completions, which are easier to collect than scores, but can only compare several completions for one prompt, and not completions across prompts.
+Other models have expanded on this approach to predict a more nuanced probability that a completion is better than the other one ([example](https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B)).
+This allows them to (theoretically) judge subtle differences between completions, at the cost of not being able to easily save and compare many different scores across prompts for the same test set. In addition, context length and memory limits can become an issue when comparing too long completions.
+Some reward models such as [SteerLM](https://arxiv.org/abs/2311.09528) output **absolute scores**, which can be used to evaluate completions directly without the need for pairwise comparisions. These models can be easier to use for evaluation, but are also harder to collect data for, as absolute scores tend to be less stable than pairwise scores in human preferences.
+More recently, models have been proposed that output both absolute and relative scores, such as [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) and [ArmoRM](https://arxiv.org/abs/2406.12845).
+<Note title="How do I use a Reward Model for Evaluation?">
+Given a dataset of prompts, we can generate completions from a language model and ask a reward model to score them.
+For models that give absolute scores, the resulting scores can be averaged to get a reasonable summary score.
+However, in the more common case of relative scores, the average reward can be biased by outliers (a few very good or very bad completions) as different prompts may have inherently different reward scales (some prompts are way harder or easier than others).
+<Sidenote>
+For relative scores, don't just average raw rewards—outliers and varying prompt difficulty scales will bias results. Use win rates or win probabilities against a reference instead.
+</Sidenote>
+Instead, we can use
+- win rates: take a reference set of completions and calculate the percentage of completions from the model that are ranked higher than the reference completions. It is slightly more granular.
+- win probabilities: the mean probability of the completions being better than the reference completions, which can give a more fine-grained and smoothly changing signal.
+</Note>
+<Note title="Pros and Cons of Reward Models">
+Reward models are typically:
+- **Very fast**: Getting a score is as simple as running a forward pass of a relatively small model once (since we only get a score, and not long text, contrary to judge-LLMs)
+- **Deterministic**: The same scores will be reproduced through the same forward pass
+- **Unlikely to suffer from positional bias**: As most models take only one completion, they can not be influenced by the order. For pairwise models, positional bias is often also minimal, as long as the training data was balanced with respect to containing both first and second answers as being the best.
+- **Require no prompt engineering**: since the model will simply output a score from one or two completions depending on preference data it's been trained on.
+On the other hand they:
+- **Require specific fine-tuning**: This can be a relatively costly step, and elthough they inherit many capabilities from a base model, they may still perform poorly on tasks that are out of the training distribution.
+- **Loose efficiency when used both in reinforcement learning and evaluation** (or when using direct alignment algorithms on datasets that are similar to the training data of the reward model), as the language model may overfit to the reward model's preferences.
+</Note>
+<Note title="Going further">
+- A good place to find high performing models is the [RewardBench Leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
+- You can look at how reward models have been used in the [Nemotron](https://arxiv.org/abs/2406.11704) paper.
+- For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
+- Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
+</Note>

app/src/content/chapters/automated-benchmarks/tips-and-tricks.mdx DELETED Viewed

@@ -1,73 +0,0 @@
----
-title: "Automated Benchmarks: Tips and tricks"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-### Pros and cons of using automated benchmarks
-Automated benchmarks have the following advantages:
-- **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
-- **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
-- **Understandability**: Most automated metrics are very understandable.
-  *Eg: an exact match will tell you if the generated text matches perfectly with the reference, and an accuracy score will tell you in how many cases the selected choice was the correct one (this will be a bit less the case for metrics such as `BLEU` or `ROUGE` for example).*
-- **Dataset quality**: A number of automated benchmarks are using expert generated datasets or pre-existing high quality data (like MMLU or MATH). However, this does not mean these datasets are perfect: for MMLU, several errors have been identified in samples afterwards, from parsing issues to actually non-sensical questions, leading to the creation of several follow-up datasets, like MMLU-Pro and MMLU-Redux.
-<Sidenote>
-Several errors in MMLU (parsing issues, nonsensical questions) led to improved versions like MMLU-Pro and MMLU-Redux. Always inspect benchmark samples manually before relying on them for evaluation.
-</Sidenote>
-However, they also present the following limitations:
-- **Reduced use on more complex tasks**: Automated benchmarks are working well for tasks where performance is easy to define and assess (for example, classification). More complex capabilities, on the other hand, are harder to decompose into well-defined and precise tasks.
-  *Eg: what does "good at math" mean? Is it being good at arithmetic? - at logic? - able to reason on new mathematical concepts?*
-  This led to the use of more **generalist** evaluations, which no longer decompose capabilities in sub-tasks, but assuming that general performance will be a **good proxy** for what we aim to measure.
-- **Contamination**: Once a dataset is published publicly in plain text, it will end up in model training datasets. This means that you have no guarantee when scoring a model that it has not parsed the evaluation data before.
-### Tips and tricks
-#### Managing contamination
-In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
-<Note title="Assume contamination" emoji="🔍" variant="warning">
-You should assume that any dataset publicly available on the internet is or will be contaminated in model training data. Design your evaluation strategy with this assumption in mind.
-</Note>
-Solutions to mitigate this include:
-- providing a **canary string** in the evaluation set (like in [BigBench](https://github.com/google/BIG-bench)): it is a specific character combination that model creators can look for in their training sets, which would indicate that it contains an evaluation
-- providing evaluation sets in **[encrypted](https://arxiv.org/abs/2309.16575) or [gated](https://huggingface.co/datasets/Idavidrein/gpqa)** forms so that they can't be parsed easily by web crawlers - therefore not ending up accidentally in training sets
-- running [dynamic benchmarks](https://arxiv.org/abs/2104.14337): benchmarks regularly updated through time so that models can't "learn the answers by heart" (but it makes datasets more costly)
-- if you are running a benchmark, trying to [detect contamination](https://arxiv.org/abs/2311.06233) post-hoc (for example, by looking at the generation perplexity or designing adversarial versions of the prompts - however, no method is a foolproof contamination detection method)
-<Sidenote>
-Even contaminated datasets can provide useful signal during training. Performance improvements on contaminated benchmarks often correlate with genuine capability improvements, though the absolute scores may be inflated.
-</Sidenote>
-However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training.
-####  Tip: an easy speed up for MCQA evaluations
-You can speed up your MCQA predictions by a lot if you make sure your model needs to predict only one token for the task.
-This way, instead of running your `number_of_choices` predictions (`context + choice 1`, `context + choice 2`, etc), you can simply run inference on `context` and compute the probability distribution on the full vocabulary (which will include all your one token choices) to get your logprobabilities of interest, and do this step in one pass.
-<Note title="Speed optimization for MCQA" emoji="⚡" variant="success">
-Speed up MCQA evaluations by using single-token choices. Instead of running N predictions for N choices, run inference once on the context and examine the probability distribution over all vocabulary tokens (which includes your choices). This is how `lighteval` achieves fast MCQA evaluation.
-</Note>
-(That's how we do it in `lighteval`).
-#### What to do if you get unexpectedly bad results on generative evaluations
-The first thing to do is always to inspect your model generations in detail. Some frequent things to look for when troubleshooting are:
-- too strict model output parsing (before computing the metric) which leads to the answer being lost
-    - Fixing: adapt your parsing
-- unability of the models to follow your output format in few shot (frequent in recent models trained with instructions data, like llama 3.2 or Qwen 2.5)
-    - Fixing: either adapt your prompt format, or just assume that models should be able to follow it in few shot
-- exceedingly verbose model which never gets to the correct answer (more frequent in long context models and something we observed with Qwen and CommandR models)
-    - Fixing: either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly

app/src/content/chapters/{2025-evaluations-for-useful-models.mdx → general-knowledge/2025-evaluations-for-useful-models.mdx} RENAMED Viewed

@@ -2,8 +2,8 @@
 title: "2025 evaluations"
 ---
-import Note from "../../components/Note.astro";
-import Sidenote from "../../components/Sidenote.astro";
 You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).

 title: "2025 evaluations"
 ---
+import Note from "../../../components/Note.astro";
+import Sidenote from "../../../components/Sidenote.astro";
 You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).

app/src/content/chapters/{picking-your-evaluation.mdx → general-knowledge/picking-your-evaluation.mdx} RENAMED Viewed

@@ -4,11 +4,8 @@ title: "Picking good automatic evaluations for pretraining"
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
-import HtmlEmbed from "../../../components/HtmlEmbed.astro";
-## What Makes a Task "Fine"?
-Covering all 7000+ languages spoken over the world would be monumental endeavor, so we settled on using **9 languages** that offered diversity in script, language family and resource availability: **Chinese, French, Arabic, Russian, Thai, Hindi, Turkish, Swahili, and Telugu**.
 For these languages, we collected all available tasks that we could find, implementing a total of **185 tasks across languages** in [LightEval](https://github.com/huggingface/lighteval), HuggingFace's model evaluation library.
@@ -24,8 +21,6 @@ For evaluation diversity, we aimed to assess a broad range of model capabilities
 We consider that tasks provide a reliable signal if they provide a dependable score. This means the score should be above the random baseline, increase as training progresses, show low variability across different seeds, and provide consistent model ranking at each training step<d-footnote>For similar sized models trained with the same hyperparameters on the same amount of data.</d-footnote>.
-### Finding how much signal our tasks give during pre-training
 To thoroughly examine the signal our tasks provide, we trained many 1.5B parameter models for each language, using 30B tokens from subsets of the supported languages of the five largest openly available multilingual web datasets. These models were trained with the same hyperparameters and tokenizer. We then evaluated them at regular checkpoint intervals on the collected tasks (with no instruction and no system prompt in a 0-shot setting).
 This process required multiple evaluation runs for each task due to iterations on its implementation, resulting in a total of **73 000 GPU hours consumed** 🔥!
@@ -101,26 +96,7 @@ We had no strict minimum value requirement for this property, instead using it t
 </div>
-## Important properties of evaluation impacting stability
-Now that we covered what we were looking for in our tasks, let's examine two important aspects that can affect the above properties: task formulations and metric choice.
-<Note>Both of these aspects are thoroughly described and studied in the brilliant OLMES paper [Gu et al., 2024](https://arxiv.org/abs/2406.08446), which greatly inspired our work.</Note>
-### Task Formulations
-The way tasks are presented to the model is crucial, particularly for multiple-choice (MC) tasks. In these scenarios, we must carefully determine how the choices are displayed and what the model is expected to predict.
-There are two common approaches: **Cloze Formulation** (CF) and **Multi-Choice Formulation** (MCF). In CF, choices are not provided in context, allowing the model to predict each option directly. In contrast, MCF presents the choices in the prompt, using A/B/C/D prefixes, with the targets being those letter prefixes.
-It's important to know that:
-- The choice of formulation significantly impacts task scores (see [the release blog of the Open LLM Leaderboard 2](https://huggingface.co/spaces/open-llm-leaderboard/blog)).
-- Both formulations **behave very differently during training**. As noted by both OLMES [Gu et al., 2024](https://arxiv.org/abs/2406.08446) and DataComp-LM [Li et al., 2024](https://arxiv.org/abs/2406.11794), when employing MCF, task scores initially show random performance over extended training periods before experiencing a sudden increase. Conversely, with CF, task scores improve right from the beginning but tend to plateau relatively early.
-Therefore, we decided to utilize CF for task selection and MCF for later evaluation of major open source models, as they have generally undergone enough training for these evaluations to have a signal.
-### Metrics
 As the targets in CF of multiple choice tasks are choices themselves, each target can have a different number of tokens, characters, and unconditional probability (probability of generating the choice without a context prefix).
@@ -129,15 +105,15 @@ As the targets in CF of multiple choice tasks are choices themselves, each targe
 To account for this, we consider the following accuracy variations:
 - **Accuracy** :
-  `acc` = <d-math>\underset{i}{\arg\max}(ln(P (a_i|q)))</d-math>
 - **Accuracy normalized over character length** :
-  `acc_char` = <d-math> \underset{i}{\arg\max}\frac{ln(P (a_i|q))}{num\_characters(a_i)}</d-math>
 - **Accuracy normalized over token length** :
-  `acc_token` = <d-math> \underset{i}{\arg\max}\frac{ln(P (a_i|q))}{num\_tokens(a_i)}</d-math>
 - **PMI Accuracy** :
-  `acc_pmi` = <d-math> \underset{i}{\arg\max}ln\frac{P (a_i|q)}{P (a_i|u)}</d-math>, where <d-math>u =</d-math>''Answer:''
-Where <d-math>a_i</d-math> is the answer choice <d-math>i</d-math>, <d-math>q</d-math> is a question prompt and <d-math>P (a_i|q)</d-math> is the probability of having <d-math>a_i</d-math> follow <d-math>q</d-math>. For more details see [Gu et al., 2024](https://arxiv.org/abs/2406.08446) and [Biderman et al., 2024](https://arxiv.org/abs/2405.14782).
 <Note>`acc_pmi` metric measures how much more likely a model is to predict A_i if provided with question context compared to if there was no context at all. This can be useful if the correct choice contains generally unlikely tokens, making the model less likely to choose such an answer.</Note>
@@ -148,56 +124,14 @@ For our generative tasks on the other hand, we used the following metrics:
 For both generative metrics, minor preprocessing is applied to remove articles and punctuation, and lowercase the text.
-## The Fine selection
-With our goals and evaluation setup properly defined, we proceeded with **task selection**!
-We reviewed tasks one by one, choosing based on the quantified properties. For each language, we aimed to have at least one task for each of the four categories outlined above. Additionally we wanted to have at least 1 generative task for each language.
-In cases where multiple versions of a task existed (e.g., MMLU with different translation methods or native versions), we **prioritized native versions** as long as their metrics were reasonable, followed by human translations of English tasks. If no such version was available, we made our selection entirely based on metrics.
-Thus, **after removing about half of the tasks**, we arrived at **96 final ones**, forming "FineTasks."
-### Explore tasks
-Use the dropdowns below to navigate the list of tasks and how different metrics affect them.
-<div id="fine-tasks-results"></div>
-All tasks from the selection **comply with the criteria** outlined in previous sections, with the only exception being indicqa_tel, which we chose to include to ensure we had at least one generative task for Telugu. Overall we managed to cover all task categories for each language (the only exception being Thai Reasoning, where all tasks were unfortunately too noisy with low monotonicity to consider them).
-One of the **biggest surprises** was that some tasks, even when translated using the same method, were **reliable in one language but not in others**. This was evident with xWinograd, which worked quite well for Russian but did not meet our conditions for French. An even more extreme example was XNLI, which performed well for 6 out of 7 languages, failing to satisfy the reliability properties for Chinese. We had to test four different implementations before finding a reliable version, which, interestingly, was the only one that was created by native speakers and not machine translated.
-Feel free to use the dropdowns below to explore the evolution of scores over training for all tested tasks and metrics.
-<div class="task-signal-plot" data-language="French" data-task="frenchbench_hellaswag_fra_cf" data-show-controls="true" data-metric="acc_norm_token" data-group-seeds="true" data-title=""></div>
-### Metrics recommendation
 Selecting the best evaluation metrics proved to be a **challenging task**. Not only is there no single metric that consistently outperforms the rest, but we often encountered situations where one metric had better monotonicity while another had a higher signal-to-noise ratio. In such cases, we typically made our decision based on the selected metric for tasks' implementation in a different language. We are aware that such hand-picking is often not possible and thus offer the following recommendations:
-#### Multichoice Tasks
 - We found **base accuracy** to perform well for tasks with answer options varying subtly (e.g. Yes/No/Also), particularly NLI tasks. In such cases, where the answer options are often each a single token, the base accuracy is advisable to use.
 - While OLMES [Gu et al., 2024](https://arxiv.org/abs/2406.08446) recommends using PMI for tasks with unusual words, we found **PMI** to be highly effective for "difficult" reasoning and knowledge tasks like AGIEVAL or MMLU. In these cases, PMI provided the best results and was often the only metric delivering performance above random. That said, PMI was, on average, the weakest metric across all other tasks, while also being two times more expensive to compute. We therefore only recommend its use for complex reasoning and knowledge tasks.
 - The metrics we found to be **most reliable overall** were length normalization metrics (token or character-based). However, the best choice was dependent on language, rather than being consistent for a given task. Due to that, we recommend using the maximum of acc_char and acc_token for the most reliable results.<d-footnote>Note that acc_token is heavily tokenizer dependent. On our ablations all models were trained using the same tokenizer.</d-footnote>
-#### Generative Tasks
 For **generative metrics**, the choice is clearer: we suggest using the F1 score unless exact matching is required, as in math-related tasks. F1 is generally less noisy and more resilient to small changes in the generations.
-## Open/Closed Source models tackle FineTasks
-Since we spent a lot of time and compute on task selection, we were interested in how well major **open-source** models would do on FineTasks. Given that our evaluation suite primarily targets pretrained models, we focused on these, with a few exceptions for models that don't offer a base (pretrained) version. These exceptions were included mainly out of curiosity, and their results should be interpreted with **caution**. Such models may significantly outperform other models due to the inclusion of supervised fine-tuning (SFT) data.
-To assess the multilingual performance disparity between open-source and closed-source models, we expanded our selection by adding a closed source model: **gpt-4o-mini**.
-As outlined in the task formulations, we are using MCF for this evaluation and employing a 5-shot approach, as recommended by OLMES [Gu et al., 2024](https://arxiv.org/abs/2406.08446) (and made possible by the large context size of the models).
-### Computing a global "multilingual" score
-In the previous sections, we treated each task independently. However, to determine an overall "multilingual" score of a model, we need to **aggregate** the results from these tasks. We begin by **rescaling** the individual task scores in line with the OpenLLM leaderboard [Fourrier et al., 2024](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). Then, we **average the scores** across task types (GK, RES, etc) for each language separately. To compute the score for each language, we take the average of the task type scores.<d-footnote>We first average by task type to properly measure all model capabilities without letting a single category dominate.</d-footnote>
-For the final global "multilingual" score we followed a different approach. Instead of averaging the language scores directly, we **ranked the model's performance across languages** in comparison to other models and then averaged those rank scores. This method ensures that the result reflects the overall model's performance across all languages, preventing an exceptionally high score in one language from skewing the final outcome.

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
+This section mostly comes from the FineTasks blog, which describes how our FineWeb team designed a method to select the best evaluations for pre-training ablations, across 9 languages!
 For these languages, we collected all available tasks that we could find, implementing a total of **185 tasks across languages** in [LightEval](https://github.com/huggingface/lighteval), HuggingFace's model evaluation library.
 We consider that tasks provide a reliable signal if they provide a dependable score. This means the score should be above the random baseline, increase as training progresses, show low variability across different seeds, and provide consistent model ranking at each training step<d-footnote>For similar sized models trained with the same hyperparameters on the same amount of data.</d-footnote>.
 To thoroughly examine the signal our tasks provide, we trained many 1.5B parameter models for each language, using 30B tokens from subsets of the supported languages of the five largest openly available multilingual web datasets. These models were trained with the same hyperparameters and tokenizer. We then evaluated them at regular checkpoint intervals on the collected tasks (with no instruction and no system prompt in a 0-shot setting).
 This process required multiple evaluation runs for each task due to iterations on its implementation, resulting in a total of **73 000 GPU hours consumed** 🔥!
 </div>
+#### Metrics
 As the targets in CF of multiple choice tasks are choices themselves, each target can have a different number of tokens, characters, and unconditional probability (probability of generating the choice without a context prefix).
 To account for this, we consider the following accuracy variations:
 - **Accuracy** :
+  `acc` = $\underset{i}{\arg\max}(ln(P (a_i|q)))$
 - **Accuracy normalized over character length** :
+  `acc_char` = $\underset{i}{\arg\max}\frac{ln(P (a_i|q))}{num\_characters(a_i)}$
 - **Accuracy normalized over token length** :
+  `acc_token` = $\underset{i}{\arg\max}\frac{ln(P (a_i|q))}{num\_tokens(a_i)}$
 - **PMI Accuracy** :
+  `acc_pmi` = $\underset{i}{\arg\max}ln\frac{P (a_i|q)}{P (a_i|u)}$, where $u =$ ''Answer:''
+Where $a_i$ is the answer choice $i$, $q$ is a question prompt and $P (a_i|q)$ is the probability of having $a_i$ follow $q$. For more details see [Gu et al., 2024](https://arxiv.org/abs/2406.08446) and [Biderman et al., 2024](https://arxiv.org/abs/2405.14782).
 <Note>`acc_pmi` metric measures how much more likely a model is to predict A_i if provided with question context compared to if there was no context at all. This can be useful if the correct choice contains generally unlikely tokens, making the model less likely to choose such an answer.</Note>
 For both generative metrics, minor preprocessing is applied to remove articles and punctuation, and lowercase the text.
 Selecting the best evaluation metrics proved to be a **challenging task**. Not only is there no single metric that consistently outperforms the rest, but we often encountered situations where one metric had better monotonicity while another had a higher signal-to-noise ratio. In such cases, we typically made our decision based on the selected metric for tasks' implementation in a different language. We are aware that such hand-picking is often not possible and thus offer the following recommendations:
+➡️ Multichoice Tasks
 - We found **base accuracy** to perform well for tasks with answer options varying subtly (e.g. Yes/No/Also), particularly NLI tasks. In such cases, where the answer options are often each a single token, the base accuracy is advisable to use.
 - While OLMES [Gu et al., 2024](https://arxiv.org/abs/2406.08446) recommends using PMI for tasks with unusual words, we found **PMI** to be highly effective for "difficult" reasoning and knowledge tasks like AGIEVAL or MMLU. In these cases, PMI provided the best results and was often the only metric delivering performance above random. That said, PMI was, on average, the weakest metric across all other tasks, while also being two times more expensive to compute. We therefore only recommend its use for complex reasoning and knowledge tasks.
 - The metrics we found to be **most reliable overall** were length normalization metrics (token or character-based). However, the best choice was dependent on language, rather than being consistent for a given task. Due to that, we recommend using the maximum of acc_char and acc_token for the most reliable results.<d-footnote>Note that acc_token is heavily tokenizer dependent. On our ablations all models were trained using the same tokenizer.</d-footnote>
+➡️ Generative Tasks
 For **generative metrics**, the choice is clearer: we suggest using the F1 score unless exact matching is required, as in math-related tasks. F1 is generally less noisy and more resilient to small changes in the generations.

app/src/content/chapters/human-evaluation/basics.mdx DELETED Viewed

@@ -1,88 +0,0 @@
----
-title: "Human Evaluation: Basics"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-Human evaluation is simply asking humans to evaluate models. In this document, we'll look at post-hoc evaluation: your model has been trained, you have a given task in mind, and humans are providing scores.
-### Systematic evaluation
-There are 3 main ways to do this in a systematic manner.
-If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with
-- a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*)
-- access to one (or several) model(s) that they can interact with,
-then ask them to provide their scores and reasoning.
-If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans.
-Lastly, if **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
-Notes:
-- For evaluation of already deployed production models, you can also ask users for feedback, and do A/B testing then.
-- [AI audits](https://arxiv.org/abs/2401.14462) (external systematic evaluation of models) are usually human based, but out of scope for this document.
-### Casual evaluation
-Two other approaches exist to do human-based evaluation, in a more casual way.
-**Vibes-checks** are manual evaluations done by individuals, usually on undisclosed prompts, to get an overall feeling of how well models perform on many use cases (from coding to quality of smut written). Often shared on Twitter and Reddit, results mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for). However, they can be [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need).
-<Sidenote>
-While vibe-checks are anecdotal and subject to confirmation bias, systematic approaches like [Wolfram Ravenwolf's comparisons](https://olshansky.substack.com/p/vibe-checks-are-all-you-need) can provide useful starting points for identifying use cases to evaluate formally.
-</Sidenote>
-**Arenas** are crowdsourced human evaluation to rank models.
-A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best".
-### Pros and cons of human evaluation
-Human evaluation is very interesting for the following reasons:
-- **Flexibility**: If you define clearly enough what you are evaluating, you can get scores for about anything!
-- **Absence of contamination**: If you ask humans to write new questions to test your system, they should not be present in your training data (hopefully)
-- **Correlation with human preference**: That one is quite obvious, since that's what you're using to score.
-  *Note: However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.*
-However, it also present a number of limitations:
-- **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
-- **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.)
-- **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
-- **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
-<Note title="Critical human evaluation biases" emoji="⚠️" variant="warning">
-Human evaluators have significant biases:
-- **First impressions**: Judge on presentation over factuality
-- **Tone**: Confident incorrect answers score higher than hesitant correct ones
-- **Self-preference**: Prefer answers aligning with their views over factually correct ones
-- **Identity**: Different demographics rate identical content very differently
-Expert annotators are less susceptible, but these biases affect crowdsourced evaluation significantly.
-</Note>
-### Systematic human evaluation
-Pros of systematic human evaluations, especially with paid annotators, are
-- **Getting high quality data** adapted to your use case, that you will be able to build on later (if you need to develop preference models for example)
-- **Data privacy**: If you rely on paid human annotators, especially if in-house, your datasets should be relatively safe, whereas using LLM-evalution with closed source API models presents less guarantee on what happens to your data, since you send it to an external service.
-- **Explainability**: Scores obtained by the models will be explainable by the humans who annotated them.
-Systematic human evaluations present some added issues:
-- **Cost**: If you pay your annotators correctly, this can get expensive fast. It's also likely you'll need rounds of iterative evaluation so that you can refine your guidelines, which adds to the cost.
-- **Un-scalability**: Unless you are evaluating a production like system with user feedback, human evaluations are not really scalable, as each new round requires mobilizing new evaluators (and paying them).
-- **Lack of reproducibility**: Unless you keep the exact same annotators continuously and your guidelines are perfectly unambiguous, it's likely some evaluations are going to be hard to reproduce precisely.
-### Casual human evaluation
-Pros of casual human evaluations are:
-- **Lesser cost**: since you rely on your crowd's good will
-- **Edge case discovery**: since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases
-- **Better scalability**: as long as you have many interested and willing participants, casual human evaluation scales better and has a lower entry cost
-The obvious problems of casual approaches (without annotator selection) are:
-- **High subjectivity**: it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
-<Sidenote>
-The "wisdom of the crowd" effect (discovered by statistician Galton) suggests individual biases cancel out at scale. However, this requires truly diverse crowds—tech forums skew heavily toward young western men, potentially undermining this effect.
-</Sidenote>
-- **Unrepresentative preference ranking**: since young western men are over re-represented on tech-sides of the internet, it can lead to very skewed preferences, mismatched to those of the general population, both in terms of topics explored and overall rankings.
-- **Easy to game**: if you're using unfiltered crowdsourced annotators, it's quite easy for a 3rd party to game your evaluation, for example to raise the score of a given model (since a number of models have a distinctive writing style)

app/src/content/chapters/human-evaluation/tips-and-tricks.mdx DELETED Viewed

@@ -1,49 +0,0 @@
----
-title: "Human Evaluation: Tips and tricks"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-### Tips and tricks
-Here are a few practical tips you might want consider when using human annotators to build an evaluation dataset. If you haven't done so yet, we recommend reading first the page on "Using human annotators" and then come back to this page.
-### Designing the task
-- **Simple is better**: Annotation tasks can get unnecessarily complex, so keep it as simple as possible. Keeping the cognitive load of the annotators to a minimum will help you ensure that they stay focused and make annotations of a higher quality.
-- **Check what you show**: Only show the necessary information for annotators to complete the task and make sure you don't include anything that could introduce extra bias.
-- **Consider your annotators time**: Where and how things are displayed can introduce extra work or cognitive load and therefore negatively impact in the quality of results. For example, make sure that the texts and the task are visible together and avoid unnecessary scrolling. If you combine tasks and the result of one informs the other, you can display them sequentially. Think about how everything is displayed in your annotation tool and see if there's any way you can simplify even more.
-- **Test the setup**: Once you have your task designed and some guidelines in place, make sure you test it yourself on a few samples before involving the whole team, and iterate as needed.
-### During the annotation
-- **Annotators should work independently**: It's better if annotators don't help each other or see each other's work during the task, as they can propagate their own biases and cause annotation drift. Alignment should always happen through comprehensive guidelines. You may want to train any new team members first on a separate dataset and/or use inter-annotator agreement metrics to make sure the team is aligned.
-<Note title="Prevent annotation drift" emoji="🎯" variant="info">
-Annotators must work independently. Collaboration can propagate individual biases and cause "annotation drift" where the team gradually diverges from guidelines. Alignment should happen only through comprehensive written guidelines.
-</Note>
-- **Consistency is key**: If you make important changes to your guidelines (e.g., changed a definition or instruction, or have added/removed labels), consider if you need to iterate over the annotated data. At least, you should track the changes in your dataset through a metadata value like `guidelines-v1`.
-### Hybrid human-machine annotation
-Sometimes teams face contraints on time and resources but don't want to sacrifice on the pros of human evaluation. In these cases, you may use the help of models to make the task more efficient.
-- **Model-aided annotation**: You may use the predictions or generations of a model as pre-annotations, so that the annotation team doesn't need to start from scratch. Just note that this could introduce the model's biases into human annotations, and that if the model's accuracy is poor it may increase work for annotators.
-<Sidenote>
-Model-aided annotation (using predictions as pre-annotations) can speed up work but introduces model biases into human annotations. If model accuracy is poor, fixing errors may take longer than annotating from scratch.
-</Sidenote>
-- **Supervise model as a judge**: You can combine the power of the model as a judge methodology (see the section on "Model as a judge") and human supervisors who validate or discard the results. Note that the biases discussed in the "Pros and cons of human evaluation" will apply here.
-- **Idenfity edge cases**: For an even faster task, use a jury of models and then have your human supervisor(s) step in where models disagree or there's a tie to break. Again, be aware of the biases discussed in the "Pros and cons of human evaluation".
-### End to end tutorial
-To build you own custom evaluation setup following these tips, you can follow this [practical tutorial](https://github.com/argilla-io/argilla-cookbook/tree/main/domain-eval) from Argilla. It guides you through building a custom evaluation task for your domain, using synthetic data and manual evaluation with [Argilla](https://github.com/argilla-io/argilla/) and [distilabel](https://github.com/argilla-io/distilabel). The guide starts from domain documents and results in a custom evaluation task that you can use to evaluate your model with [lighteval](https://github.com/huggingface/lighteval).

app/src/content/chapters/human-evaluation/using-human-annotators.mdx CHANGED Viewed

@@ -6,8 +6,9 @@ import bestAnnotationPractices from '../../assets/image/best_annotation_practice
 import Image from '../../../components/Image.astro';
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
-### Using human annotators
 I suggest reading Section 3 of this [review](https://aclanthology.org/2024.cl-3.1/) of good practices in data annotation quality. If you want production level quality and have the means to implement all of these methods, go ahead!
@@ -49,8 +50,36 @@ Be ready to try several rounds of annotations, as your annotators will misunders
 You want to control answers (notably via inter-annotator agreement if you can get it) and do a final selection to keep only the highest quality/most relevant answers.
 Specialized tools to build annotated high quality datasets like [Argilla](https://argilla.io/) can also help you.
-### Going further
 - ⭐ [How to set up your own annotator platform in a couple minutes](https://huggingface.co/learn/cookbook/enterprise_cookbook_argilla), by Moritz Laurer. A good read to get some hands on experience using open source tools (like Argilla and Hugging Face), and understanding better the dos and don'ts of human annotation at scale.
 - ⭐ [A guide on annotation good practices](https://aclanthology.org/2024.cl-3.1/). It's a review of all papers about human annotation dating from 2023, and it is very complete. Slightly dense, but very understandable.
 - [Another guide on annotation good practices](https://scale.com/guides/data-labeling-annotation-guide), by ScaleAI, specialised in human evaluations. Its a more lightweigth complement to the above document.
 - [Assumptions and Challenges of Capturing Human Labels](https://aclanthology.org/2024.naacl-long.126/) is a paper on how to look at source of annotator disagreement and mitigate them in practice

 import Image from '../../../components/Image.astro';
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
+import Accordion from "../../../components/Accordion.astro";
+#### Using human annotators
 I suggest reading Section 3 of this [review](https://aclanthology.org/2024.cl-3.1/) of good practices in data annotation quality. If you want production level quality and have the means to implement all of these methods, go ahead!
 You want to control answers (notably via inter-annotator agreement if you can get it) and do a final selection to keep only the highest quality/most relevant answers.
 Specialized tools to build annotated high quality datasets like [Argilla](https://argilla.io/) can also help you.
+<Note title="Going further">
 - ⭐ [How to set up your own annotator platform in a couple minutes](https://huggingface.co/learn/cookbook/enterprise_cookbook_argilla), by Moritz Laurer. A good read to get some hands on experience using open source tools (like Argilla and Hugging Face), and understanding better the dos and don'ts of human annotation at scale.
 - ⭐ [A guide on annotation good practices](https://aclanthology.org/2024.cl-3.1/). It's a review of all papers about human annotation dating from 2023, and it is very complete. Slightly dense, but very understandable.
 - [Another guide on annotation good practices](https://scale.com/guides/data-labeling-annotation-guide), by ScaleAI, specialised in human evaluations. Its a more lightweigth complement to the above document.
 - [Assumptions and Challenges of Capturing Human Labels](https://aclanthology.org/2024.naacl-long.126/) is a paper on how to look at source of annotator disagreement and mitigate them in practice
+</Note>
+<Accordion title="Practical tips and tricks">
+Here are a few practical tips you might want consider when using human annotators to build an evaluation dataset.
+**Designing the task**
+- **Simple is better**: Annotation tasks can get unnecessarily complex, so keep it as simple as possible. Keeping the cognitive load of the annotators to a minimum will help you ensure that they stay focused and make annotations of a higher quality.
+- **Check what you show**: Only show the necessary information for annotators to complete the task and make sure you don't include anything that could introduce extra bias.
+- **Consider your annotators time**: Where and how things are displayed can introduce extra work or cognitive load and therefore negatively impact in the quality of results. For example, make sure that the texts and the task are visible together and avoid unnecessary scrolling. If you combine tasks and the result of one informs the other, you can display them sequentially. Think about how everything is displayed in your annotation tool and see if there's any way you can simplify even more.
+- **Test the setup**: Once you have your task designed and some guidelines in place, make sure you test it yourself on a few samples before involving the whole team, and iterate as needed.
+**During the annotation**
+- **Annotators should work independently**: It's better if annotators don't help each other or see each other's work during the task, as they can propagate their own biases and cause annotation drift. Alignment should always happen through comprehensive guidelines. You may want to train any new team members first on a separate dataset and/or use inter-annotator agreement metrics to make sure the team is aligned.
+- **Consistency is key**: If you make important changes to your guidelines (e.g., changed a definition or instruction, or have added/removed labels), consider if you need to iterate over the annotated data. At least, you should track the changes in your dataset through a metadata value like `guidelines-v1`.
+**Hybrid human-machine annotation**
+Sometimes teams face contraints on time and resources but don't want to sacrifice on the pros of human evaluation. In these cases, you may use the help of models to make the task more efficient.
+- **Model-aided annotation**: You may use the predictions or generations of a model as pre-annotations, so that the annotation team doesn't need to start from scratch. Just note that this could introduce the model's biases into human annotations, and that if the model's accuracy is poor it may increase work for annotators.
+- **Supervise model as a judge**: You can combine the power of the model as a judge methodology (see the section on "Model as a judge") and human supervisors who validate or discard the results. Note that the biases discussed in the "Pros and cons of human evaluation" will apply here.
+- **Idenfity edge cases**: For an even faster task, use a jury of models and then have your human supervisor(s) step in where models disagree or there's a tie to break. Again, be aware of the biases discussed in the "Pros and cons of human evaluation".
+</Accordion>

app/src/content/chapters/model-as-a-judge/basics.mdx DELETED Viewed

@@ -1,50 +0,0 @@
----
-title: "Model as a Judge: Basics"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
-Judge models range from small specialized classifiers (think "spam filter", but for toxicity for example) to LLMs, either large and generalist or small and specialized. In the latter case, when using an LLM as a judge, you give it a prompt to explain how to score models (ex: `Score the fluency from 0 to 5, 0 being completely un-understandable, ...`).
-Model as judges allow to score text on complex and nuanced properties.
-For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
-That's where models as judges come into play.
-They are used on 3 main tasks:
-- *Scoring a model generation*, on a provided scale, to assess a property of the text (fluency, toxicity, coherence, persuasiveness, etc).
-- *Pairwise scoring*: comparing a pair model outputs to pick the best text with respect to a given property
-- *Computing the similarity* between a model output and a reference
-*Note: In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md))*
-### Pros and cons of using judge-LLMs
-People in favor of judge LLMs have been claiming they provide better:
-- **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner (theoretically - in my opinion, they add more subtle bias than they are worth)
-- **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data (if you control for temperature).
-- **Cost**: They are cheap to instantiate, as they don't require to train a new model, and can just rely on good prompting and an existing high quality LLM. They are also cheaper than paying actual human annotators (capitalism...).
-- **Alignment with human judgments**: They are somehow correlated with human judgments.
-In my opinion, using LLM judges correctly is extremely tricky, and it's easy to be deceived for critical use cases:
-- LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see [model-as-a-judge/Tips and tricks]). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
-- They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
-- They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
-<Note title="Critical limitations of LLM judges" emoji="⚠️" variant="warning">
-Using LLM judges is extremely tricky:
-- **Hidden biases**: Harder to detect than human biases; creates echo-chamber effects
-- **Data overload**: Generates massive synthetic data needing quality examination
-- **False objectivity**: Seems objective but reinforces subtle biases
-- **Expert humans better**: For critical use cases, expert annotators provide higher quality
-See [Tips and tricks](./tips-and-tricks) for bias mitigation strategies.
-</Note>
-This section is a bit long, because you need to be well aware of their limitations: a lot of people are blindly jumping into using model judges because they seem easier, but then end up with uninsterpretable data with tricky bias to extract.
-If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) (⭐) by Aymeric Roucher on how to setup your first LLM as judge!
-You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.

app/src/content/chapters/model-as-a-judge/designing-your-evaluation-prompt.mdx DELETED Viewed

@@ -1,81 +0,0 @@
----
-title: "Designing your evaluation prompt"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-### Designing your evaluation prompt
-Once you've selected your model, you need to define what is the best possible prompt for your task.
-Some general guidelines I've come across online when designing the prompt itself are:
-- Provide a clear description of the task at hand:
-	- `Your task is to do X`.
-	- `You will be provided with Y`.
-- Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
-	- `You should evaluate property Z on a scale of 1 - 5, where 1 means ...`
-	- `You should evaluate if property Z is present in the sample Y. Property Z is present if ...`
-- Provide some additional "reasoning" evaluation steps:
-	- `To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...`
-- Specify the desired output format (adding fields will help consistency)
-	- `Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}`
-<Note title="Core prompt design principles" emoji="📝" variant="info">
-**Essential elements for effective judge prompts:**
-- **Clear task description**: Specify exactly what the judge needs to do
-- **Detailed criteria**: Provide explicit scoring scales with clear definitions
-- **Reasoning steps**: Guide the judge through the evaluation process
-- **Structured output**: Use JSON format for consistency and parsability
-</Note>
-You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
-Other tidbits:
-- Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
-- If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
-- Using one prompt per capability to score tends to give better and more robust results
-<Sidenote>
-Pairwise comparison consistently outperforms absolute scoring for judging model outputs. It correlates better with human preferences and is less sensitive to judge biases and scale interpretation issues.
-</Sidenote>
-You can also improve accuracy using the following, possibly more costly, techniques:
-- **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
-- **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy
-- **CoT**: [improves accuracy](https://arxiv.org/abs/2212.08073), if you ask the model to output its chain of thought **before** the score (also observed [here](https://x.com/seungonekim/status/1749289437165769177))
-- **Multiturn analysis**: can improve [factual error detection](https://arxiv.org/abs/2305.13281)
-- Using **a jury** (many judges, where you pick an aggregate of the answers): [gives better results](https://arxiv.org/abs/2404.18796) than using a single model.
-	- It can be made considerably less costly by leveraging many smaller models instead of one big expensive model.
-	- You can also experiment with using one model with variations on temperature
-- Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
-<Note title="Advanced techniques to improve accuracy" emoji="⚡" variant="success">
-**More sophisticated but effective approaches:**
-- **Chain-of-Thought (CoT)**: Ask for reasoning BEFORE the score
-- **Judge jury**: Multiple judges with aggregated results (can use smaller models to reduce cost)
-- **Few-shot examples**: Provide examples, though this increases context length
-- **Reference answers**: Include reference material to improve accuracy
-- **Multi-turn analysis**: Better for detecting factual errors
-</Note>
-Note on prompting: Depending on the stakes of your use case, to remove as much bias as possible, you would want to look at work done in sociology on how to design good surveys. If you treat your evaluator as a replacement for a human annotator, then you need to look at similar metrics: computing inter-annotator agreement, using correct survey design methodology to mitigate bias, etc.
-<Note title="High-stakes evaluation requires rigor" emoji="⚠️" variant="warning">
-For production or critical use cases, apply rigorous methodologies from sociology:
-- Compute inter-annotator agreement metrics
-- Use proper survey design methodology to mitigate bias
-- Treat the evaluator like a human annotator with similar quality standards
-Quick evaluations with "OK-ish prompts" may suffice for low-stakes exploration, but don't mistake convenience for quality when decisions matter.
-</Note>
-However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).

app/src/content/chapters/model-as-a-judge/evaluating-your-evaluator.mdx DELETED Viewed

@@ -1,61 +0,0 @@
----
-title: "Evaluating your evaluator"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-### Evaluating your evaluator
-Before using a judge-LLM in production or at scale, you want to first evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
-Note: *This will be easier to do if it predicts binary outputs, because you'll be able to interpretable classification metrics (accuracy/recall/precision). If it predicts scores on a scale, it will be much harder to estimate the quality of the correlation with a reference.*
-<Sidenote>
-Binary outputs (yes/no, pass/fail) are much easier to evaluate than continuous scores. You can use clear metrics like accuracy, precision, and recall. Continuous scores require correlation analysis which is harder to interpret.
-</Sidenote>
-So, once you have selected your model judge and its prompt, you'll need to do the following.
-1. **Pick your baseline**
-You'll need to compare your evaluator judgments to a baseline: it can be human annotations, the output of another judge model that you know is qualitative on your task, a gold truth, itself with another prompt, etc.
-You don't necessarily need a lot of examples (50 can be enough), but you need them to be extremely representative of your task, discriminative (representative of edge cases notably), and of as high quality as you can manage.
-<Note title="Quality over quantity for baseline" emoji="🎯" variant="info">
-You don't need many baseline examples (50 can suffice), but they must be:
-- **Representative**: Cover the full range of your task
-- **Discriminative**: Include edge cases and challenging examples
-- **High quality**: Use the best reference data you can obtain
-</Note>
-2. **Pick your metric**
-Your metric will be used to compare your judge's evaluations with your reference.
-In general, this comparison is considerably easier to do if your model is predicting binary classes or doing pairwise comparison, as you'll be able to compute accuracy (for pairwise comparison), or precision and recall (for binary classes), which are all very easy to interpret metrics.
-Comparing the correlation of scores with human or model scoring will be harder to do. To understand why in more detail, I advise you to read this cool [blog section on the topic](https://eugeneyan.com/writing/llm-evaluators/#key-considerations-before-adopting-an-llm-evaluator).
-In general, if you're a bit lost about what metrics to pick when (in terms of models, metrics, ...), you can also look at [this interesting graph](https://eugeneyan.com/assets/llm-eval-tree.jpg) from [the same above blog](https://eugeneyan.com/writing/llm-evaluators/) ⭐.
-3. **Evaluate your evaluator**
-For this step, you simply need to use your model and its prompt to evaluate your test samples! Then, once you get the evaluations, use your above metric and reference to compute a score for your evaluations.
-You need to decide what your threshold for acceptance is. Depending on how hard your task is, you can aim for 80% to 95% accuracy, if you're doing pairwise comparison. Regarding correlations (if you're using scores), people in the literature tend to seem happy with 0.8 Pearson correlation with a reference. However, I've seen some papers declare that 0.3 indicates a good correlation with human annotators (^^") so ymmv.
-<Note title="Acceptance thresholds vary widely" emoji="📊" variant="warning">
-**Realistic thresholds for judge quality:**
-- **Pairwise comparison**: Aim for 80-95% accuracy depending on task difficulty
-- **Score correlation**: 0.8 Pearson correlation is considered good, but some papers claim 0.3 is acceptable
-The wide range in reported "acceptable" correlations (0.3 to 0.8) suggests you should carefully set your own thresholds based on your specific use case requirements.
-</Note>

app/src/content/chapters/model-as-a-judge/getting-a-judge-llm.mdx DELETED Viewed

@@ -1,78 +0,0 @@
----
-title: "Getting a Judge-LLM"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-### Getting a Judge-Model
-When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4),  using [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
-#### Using a generalist LLM
-With the introduction of more capable LLMs (such as ChatGPT), some researchers started exploring using big models as judges. The best current big model judges tend to be closed source models (like Claude or gpt-o models) though the gap with open source is closing very fast thanks to high quality models such as [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e), [Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024) or [Llama 3.1-405-Instruct](meta-llama/Llama-3.1-405B-Instruct).
-Closed source models, despite their performance, present the multiple disadvantages of being:
-- under APIs, which mean that models (therefore results) can change with no notice, hurting the reproducibility of evals
-- black boxes, which makes them un-interpretable
-- possible sources of data leakage/lack of data privacy, as you send your data to a third party through the internet (which tends to be less safe than locally managed data), and you don't know for certain what is done with it (you often need to opt out of it being used in training sets).
-<Note title="Closed vs open source judge models" emoji="⚖️" variant="warning">
-**Closed source models (Claude, GPT-o) tradeoffs:**
-Disadvantages:
-- **Non-reproducible**: Models can change without notice via API updates
-- **Black box**: Un-interpretable decision-making
-- **Privacy risks**: Data sent to third parties, potential leakage
-Advantages:
-- Easy access without local setup or hardware requirements
-**Open source models are closing the gap** while solving reproducibility and interpretability issues. Models like Qwen 2.5, Command R+, and Llama 3.1-405-Instruct are now competitive alternatives.
-</Note>
-However, they also allow anyone to have access to a high quality model without needing to setup things locally or requiring access to hardware. This pros are now also present for most high quality open models, which are accessible through model providers, and solve the first 2 problems above.
-You'll find a good cost analysis of model providers [here](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard) if you need help picking one.
-#### Using a tiny specialized LLM judge model
-You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
-<Sidenote>
-Tiny specialized judge models (3-13B parameters) can run on consumer hardware while being trained specifically for evaluation tasks. They require following specific prompt formats but offer local deployment and fast inference.
-</Sidenote>
-Some existing models:
-- Flow-Judge-v0.1 ([weights](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)), 3.8B parameters, a Phi-3.5-mini-instruct fine-tuned on a synthetic preference dataset
-- Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset. A 7B parameter [v2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) also exists, a Mistral-7B-Instruct-v0.2 fine-tune on a bigger synthetic preference dataset, with added weight merging
-- JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models.
-#### Training your own
-You can also make the choice to train or fine-tune your own LLM-as-judge.
-You first need to gather preference data for your task of interest, which can come
-- From existing [human preference datasets](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
-- From model generated preference data (which you can generate following the above tiny-model judges papers data sections, or get directly, for example from the Prometheus [preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) and [feedback](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) collections).
-Then you need to decide whether to start from a small model to train from scratch, or from an existing model, that you can
-- distill into a new smaller model
-- quantize.
-- then fine-tune (using peft or adapter weights if the model is big and your training compute low) using the above data
-	- apparently [starting from a reward model works better than from an instruct model](https://x.com/dk21/status/1826292289930674590)
-<Note title="Training your own judge model" emoji="🔧" variant="info">
-**Key steps for custom judge training:**
-1. **Gather preference data**: Use human preference datasets or synthetic data from other models
-2. **Choose starting point**: Train from scratch, distill from larger model, or fine-tune existing model
-3. **Optimize for compute**: Use PEFT/adapter weights for efficient training on limited hardware
-4. **Pro tip**: Starting from a reward model reportedly works better than starting from an instruct model
-</Note>

app/src/content/chapters/model-as-a-judge/tips-and-tricks.mdx DELETED Viewed

@@ -1,51 +0,0 @@
----
-title: "Model as a Judge: Tips and tricks"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-### Tips and tricks
-**Mitigating well known biases of LLM as judges**
-<Note title="Known LLM judge biases and mitigations" emoji="⚠️" variant="warning">
-- **Lack of internal consistency**: Different judgments at temperature > 0
-  - Mitigation: Self-consistency prompting (multiple runs, majority vote)
-- **Self-preference**: [Favor own outputs](https://arxiv.org/abs/2404.13076)
-  - Mitigation: Use judge jury
-- **Blindness to perturbation**: Can't identify [perturbed input](https://arxiv.org/abs/2406.13439)
-  - Mitigation: Chain-of-thought before scoring, coherent grading scale
-- **Position bias**: [Favor specific positions](https://arxiv.org/abs/2306.05685)
-  - Mitigation: Random position switching, log-probability normalization
-- **Verbosity bias**: Prefer verbose answers
-  - Mitigation: [Account for length differences](https://arxiv.org/abs/2404.04475)
-- **Format bias**: Fail when format differs from training
-  - Mitigation: Match training prompt format
-</Note>
-- **Lack of internal consistency**: a judge might give you different judgments if you prompt it several times (if the temperature is not 0)
-	- You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
-- **Self-preference**: they tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
-	- You can mitigate this by using a jury
-- **Blindness to input perturbation**: models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale.
-	- You can mitigate this by
-		- asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
-		- providing a coherent grading scale in the prompt.
-- **Position-bias**: they tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice
-	- You can mitigate this by
-		- switching answer positions randomly
-		- computing the log-probabilities of all possible choices to get a normalized answer
-- **Verbosity-bias** (or length-bias): they tend to like more verbose answers
-	- You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
-- **Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
-	- However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.
-- **Format bias**: they tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
-	- You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
-**Picking correct tasks for an LLM judge**
-LLM evaluators:
-- are **bad at identifying hallucinations** in general, particularly what are called partial hallucinations (which look close to the ground truth but are actually slightly different) (see [this](https://arxiv.org/abs/2305.11747) and [this](https://arxiv.org/abs/2303.08896))
-- have a low to OK-ish correlation with human annotators on [summarization](https://arxiv.org/abs/2304.02554) ([here too](https://arxiv.org/abs/2303.16634)), [faithfulness](https://arxiv.org/abs/2307.16877), and are not consistently correlated with human judgement more broadly against [a scope of tasks](https://arxiv.org/abs/2406.18403)

app/src/content/chapters/model-as-a-judge/what-about-reward-models.mdx DELETED Viewed

@@ -1,85 +0,0 @@
----
-title: "What about Reward Models?"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-### What about Reward Models?
-Reward models learn to predict a score from human annotations for given prompt/completion pairs. The end goal is for them to do predictions aligned with human preference.
-Once trained, these models can then be used to improve other models, by acting as a a reward function which is a proxy for human judgment.
-The most common type of reward model is the Bradley-Terry model, which outputs a single **pairwise score**, following:
-$$p(\text{completion b is better than completion a}) = \text{sigmoid}(\text{score}_b - \text{score}_a)$$
-This model is trained using only pairwise comparisons of completions, which are easier to collect than scores, but can only compare several completions for one prompt, and not completions across prompts.
-Other models have expanded on this approach to predict a more nuanced probability that a completion is better than the other one ([example](https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B)).
-This allows them to (theoretically) judge subtle differences between completions, at the cost of not being able to easily save and compare many different scores across prompts for the same test set. In addition, context length and memory limits can become an issue when comparing too long completions.
-<Note title="Types of reward models" emoji="📊" variant="info">
-**Three main approaches:**
-- **Pairwise (Bradley-Terry)**: Most common. Compares two completions for same prompt. Easier to train (pairwise comparisons) but can't compare across different prompts.
-- **Absolute scores** (e.g., SteerLM): Direct evaluation without comparison. Easier to use but harder to collect training data (absolute scores less stable in human preferences).
-- **Hybrid models** (HelpSteer2, ArmoRM): Output both absolute and relative scores for maximum flexibility.
-</Note>
-Some reward models such as [SteerLM](https://arxiv.org/abs/2311.09528) output **absolute scores**, which can be used to evaluate completions directly without the need for pairwise comparisions. These models can be easier to use for evaluation, but are also harder to collect data for, as absolute scores tend to be less stable than pairwise scores in human preferences.
-More recently, models have been proposed that output both absolute and relative scores, such as [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) and [ArmoRM](https://arxiv.org/abs/2406.12845).
-#### How do I use a Reward Model for Evaluation?
-Given a dataset of prompts, we can generate completions from a language model and ask a reward model to score them.
-For models that give absolute scores, the resulting scores can be averaged to get a reasonable summary score.
-However, in the more common case of relative scores, the average reward can be biased by outliers (a few very good or very bad completions) as different prompts may have inherently different reward scales (some prompts are way harder or easier than others).
-<Sidenote>
-For relative scores, don't just average raw rewards—outliers and varying prompt difficulty scales will bias results. Use win rates or win probabilities against a reference instead.
-</Sidenote>
-Instead, we can use
-- win rates: take a reference set of completions and calculate the percentage of completions from the model that are ranked higher than the reference completions. It is slightly more granular.
-- win probabilities: the mean probability of the completions being better than the reference completions, which can give a more fine-grained and smoothly changing signal.
-#### Pros and Cons of Reward Models
-Reward models are typically:
-- **Very fast**: Getting a score is as simple as running a forward pass of a relatively small model once (since we only get a score, and not long text, contrary to judge-LLMs)
-- **Deterministic**: The same scores will be reproduced through the same forward pass
-- **Unlikely to suffer from positional bias**: As most models take only one completion, they can not be influenced by the order. For pairwise models, positional bias is often also minimal, as long as the training data was balanced with respect to containing both first and second answers as being the best.
-- **Require no prompt engineering**: since the model will simply output a score from one or two completions depending on preference data it's been trained on.
-On the other hand they:
-- **Require specific fine-tuning**: This can be a relatively costly step, and elthough they inherit many capabilities from a base model, they may still perform poorly on tasks that are out of the training distribution.
-- **Loose efficiency when used both in reinforcement learning and evaluation** (or when using direct alignment algorithms on datasets that are similar to the training data of the reward model), as the language model may overfit to the reward model's preferences.
-<Note title="Reward models vs LLM judges" emoji="⚡" variant="success">
-**Reward models excel at:**
-- **Speed**: Single forward pass for a score (no text generation)
-- **Determinism**: Reproducible scores, no temperature variation
-- **No positional bias**: Models trained on balanced data avoid order effects
-- **Zero prompt engineering**: Just pass completions, get scores
-**But beware:**
-- **Require fine-tuning**: Costly setup, may fail out-of-distribution
-- **Overfitting risk**: Language models can learn to game the reward model during RL training
-</Note>
-Some notes:
-- A good place to find high performing models is the [RewardBench Leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
-- You can look at how reward models have been used in the [Nemotron](https://arxiv.org/abs/2406.11704) paper.
-- For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
-- Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) recent paper, can allow you to detect model degradation and select optimal checkpoints.

app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx CHANGED Viewed

@@ -7,6 +7,17 @@ import Sidenote from "../../../components/Sidenote.astro";
 ## Troubleshooting inference
 ### My model is very slow!
 ➡️ Changing the batch size

 ## Troubleshooting inference
+### My results are very bad
+The first thing to do is always to inspect your model generations in detail. Some frequent things to look for when troubleshooting are:
+- too strict model output parsing (before computing the metric) which leads to the answer being lost
+    - Fixing: adapt your parsing
+- unability of the models to follow your output format in few shot (frequent in recent models trained with instructions data, like llama 3.2 or Qwen 2.5)
+    - Fixing: either adapt your prompt format, or just assume that models should be able to follow it in few shot
+- exceedingly verbose model which never gets to the correct answer (more frequent in long context models and something we observed with Qwen and CommandR models)
+    - Fixing: either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly
 ### My model is very slow!
 ➡️ Changing the batch size

app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx CHANGED Viewed

@@ -5,12 +5,10 @@ title: "Troubleshooting reproducibility"
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
-## Troubleshooting reproducibility
 Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
 Let's explore why.
-### Different code base
 To reproduce evaluation scores to the decimal point, you first need to make sure you're using exactly the same code base as the paper you want to reproduce.
 Usually, this means either using the evaluation default code as provided by the authors, or a standard implementation in a reference library like Eleuther's AI `lm_eval` or HuggingFace's `lighteval`. However, if the code source for evaluation is not provided, then, I'm sorry for you but it's unlikely that you'll be able to reproduce the results precisely.
@@ -19,7 +17,7 @@ If you want to easily understand what kind of discrepancies happen when using di
 *Note: This is precisely for this reason that a Hugging Face team decided to launch the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), to get unified and homogeneous comparisons of models scores in order to compare them to internal experiments.*
-### Other subtle ways in which the implementation can be different
 We've observed that the following were easy things to mess up, even when using the same code base:
 - **Different random seeds.**
 	- Normally, inference is less affected by random seeds than training. However, they can still affect some CUDA operations (see the PyTorch page on [reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)) and change predictions if you're using a non greedy generation strategy. They can also affect the prompt if you're using few-shots, and some pre or post-processing functions.
@@ -35,21 +33,22 @@ We've observed that the following were easy things to mess up, even when using t
 	  (The `lm_eval` v2 now includes the normalization name in most metric names.)
 	 -> This is one of the easiest things to mess up, especially for tasks which require a lot of normalization/answer post processing, like math evaluations (where you want to extract the answer from a generated explanation).
-<Note title="Subtle reproducibility pitfalls" emoji="⚠️" variant="warning">
-**Common sources of score differences (even with same codebase):**
-- **Random seeds**: Can affect CUDA ops, sampling strategies, and few-shot prompt selection (multi-point differences possible)
-- **Metric ambiguity**: "Exact match" can mean log-likelihood matching, generative matching, prefix/suffix/quasi-matching—always check the code, not just the name
-- **Hidden normalization**: Predictions may be normalized (punctuation removal, number formatting) before comparison—easy to miss especially in math evals
-**Key lesson**: Never trust metric names alone. Read the actual implementation.
 </Note>
-### Different prompt
 3 main things can come into play for prompt variation.
-### Prompt itself
 The format you are using for the prompt can and will change scores wildly.
 For example, for multichoice question answers, some common formats include very simple variations when presenting the choices, such as:
@@ -68,63 +67,45 @@ Answer:
 ```
 and predicting either `A`/`B`/`C`/`D` or `<Choice A/B/C/D>`.
-These prompts are **semantically equivalent**, as they contain the exact same content - but they can still result in a difference of *several points for the same model*. We did some experiments on this [here](https://x.com/clefourrier/status/1777319187913875893/photo/1) (you'll see up to a 7 points difference for the same model) and a [paper observed similar results](https://arxiv.org/abs/2310.11324).
-Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.
 <Note title="Prompt format sensitivity" emoji="📝" variant="danger">
-**Semantically identical prompts can cause 7+ point score differences!**
 Even tiny formatting variations (like `A.` vs `(A)` vs just listing choices) significantly impact scores. Models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability.
 **Real example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly because they overfit to GSM8K's prompt format and couldn't adapt to different few-shot templates.
 </Note>
 This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
-This is something we observed on the Open LLM Leaderboard 2 for the Llama3.1 models. They were predicting the correct answers to our MATH-Hard evaluations, but were getting low scores, being unable to fit to the template provided in few-shot because they overfit the GSM8K prompt and answer format (another math eval).
-### System prompt and chat template
 Chat models usually have been through instruction/preference training or fine-tuning. During this stage, they have learned to follow specific templates when inferring. For example, templates can require starting rounds of dialogue with a general prompt (called the `system prompt`) prefixed by specific tokens (usually `System: `). Said prompt is here to provide high-level instructions for the model, such as the contents of a persona, or general answering style instructions. Rounds of dialogue can also require adding prefix key words to text, such as `User` for queries and `Assistant` for answers.
 When using few shot, you also need to select if you want examples to be provided multi-turn (mimicking user/assistant turns) or all at once (in a single user prompt).
 Not following the chat template expected by the model at inference will kill its performance, as it will drive its output outside of the probability space it's been converging on.
-### Few-shots samples
-Two things are easy to mess up with few-shot samples (see `general-knowledge/Model inference` if you're unsure what it is).
-Obviously, you need to use the **same number of few-shot samples** as your task of reference.
-However, you also need to use the **exact same samples** as the model you are comparing to, as using different samples will change results (which is not too surprising, if we assume some samples are better at expressing the task than others). More surprising maybe: you not only need to use the exact same samples, but also present them in the **exact same order**. Varying the order on the same samples led us to observe up to 3 points of difference on some subsets of MMLU (you can see [some results here](https://huggingface.co/blog/evaluation-structured-outputs) , it's the third colorgrid).
-This is also a place where paying attention to the random seeds is important.
-### Different generation parameters
-For generative evaluations, parameters to pay attention to are:
-- making sure you are using the **same end of sentence token**
-- making sure you are allowing your model to **generate the same number of tokens** for the evaluation
-- making sure, if using sampling, that you are using the **same seed/temperature parameters**
-### Different model loading
-Some sources of differences that we have observed are:
-- using **different hardware**.
-  Pytorch does not ensure reproducibility of non deterministic operations across hardware
-- using **different libraries**.
-  For example, if you use `transformers` vs `vllm` as your backend for inference, matrix computations are not managed exactly in the same way)
-- using **different batch sizes**.
-  It's been documented in several evaluation libraries and model backends that using different batch sizes will change inference results - if you want fully reproducible evaluations, you should fix the batch size, though it might not always be possible for memory issues
-- using **different loading precision** for your model weights.
-  Using a lower precision can reduce memory and inference costs, but it will also change the numerical results, since you are using different versions of the weights.
-<Note title="Model loading affects reproducibility" emoji="🔧" variant="warning">
-**Four factors that change results even with identical code:**
-- **Hardware**: PyTorch doesn't guarantee reproducibility across different GPUs/hardware
-- **Inference library**: transformers vs vllm handle matrix ops differently
-- **Batch size**: Different batch sizes = different results (fix batch size for reproducibility, though memory may limit this)
-- **Loading precision**: Lower precision (float16 vs float32) changes numerical results
-</Note>

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
 Let's explore why.
+#### Different code base
 To reproduce evaluation scores to the decimal point, you first need to make sure you're using exactly the same code base as the paper you want to reproduce.
 Usually, this means either using the evaluation default code as provided by the authors, or a standard implementation in a reference library like Eleuther's AI `lm_eval` or HuggingFace's `lighteval`. However, if the code source for evaluation is not provided, then, I'm sorry for you but it's unlikely that you'll be able to reproduce the results precisely.
 *Note: This is precisely for this reason that a Hugging Face team decided to launch the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), to get unified and homogeneous comparisons of models scores in order to compare them to internal experiments.*
+#### Subtle implementation or loading difference
 We've observed that the following were easy things to mess up, even when using the same code base:
 - **Different random seeds.**
 	- Normally, inference is less affected by random seeds than training. However, they can still affect some CUDA operations (see the PyTorch page on [reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)) and change predictions if you're using a non greedy generation strategy. They can also affect the prompt if you're using few-shots, and some pre or post-processing functions.
 	  (The `lm_eval` v2 now includes the normalization name in most metric names.)
 	 -> This is one of the easiest things to mess up, especially for tasks which require a lot of normalization/answer post processing, like math evaluations (where you want to extract the answer from a generated explanation).
+<Note title="Model loading affects reproducibility" emoji="🔧" variant="warning">
+**Four factors that change results even with identical code:**
+- **Hardware**: PyTorch doesn't guarantee reproducibility across different GPUs/hardware
+- **Inference library**: transformers, vllm and sglang handle batching and matrix operations slightly differently as of 2025
+- **Batch size**: Different batch sizes = different results (you should fix the batch size for reproducibility, though careful about OOM errors)
+- **Loading precision**: Lower precision (especially quantized models vs floating point models) will change numerical results
 </Note>
+#### Different prompt
 3 main things can come into play for prompt variation.
+**Prompt itself**
 The format you are using for the prompt can and will change scores wildly.
 For example, for multichoice question answers, some common formats include very simple variations when presenting the choices, such as:
 ```
 and predicting either `A`/`B`/`C`/`D` or `<Choice A/B/C/D>`.
+These prompts are **semantically equivalent**, as they contain the exact same content - but they can still result in a difference of *several points for the same model*.
 <Note title="Prompt format sensitivity" emoji="📝" variant="danger">
+We did some experiments on this [here](https://x.com/clefourrier/status/1777319187913875893/photo/1) (you'll see up to a 7 points difference for the same model) and a [paper observed similar results](https://arxiv.org/abs/2310.11324).
+Semantically identical prompts can cause 7+ point score differences!
 Even tiny formatting variations (like `A.` vs `(A)` vs just listing choices) significantly impact scores. Models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability.
 **Real example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly because they overfit to GSM8K's prompt format and couldn't adapt to different few-shot templates.
+This is something we observed on the Open LLM Leaderboard 2 for the Llama3.1 models. They were predicting the correct answers to our MATH-Hard evaluations, but were getting low scores, being unable to fit to the template provided in few-shot because they overfit the GSM8K prompt and answer format (another math eval).
 </Note>
 This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
+Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.
+**System prompt and chat template**
 Chat models usually have been through instruction/preference training or fine-tuning. During this stage, they have learned to follow specific templates when inferring. For example, templates can require starting rounds of dialogue with a general prompt (called the `system prompt`) prefixed by specific tokens (usually `System: `). Said prompt is here to provide high-level instructions for the model, such as the contents of a persona, or general answering style instructions. Rounds of dialogue can also require adding prefix key words to text, such as `User` for queries and `Assistant` for answers.
 When using few shot, you also need to select if you want examples to be provided multi-turn (mimicking user/assistant turns) or all at once (in a single user prompt).
 Not following the chat template expected by the model at inference will kill its performance, as it will drive its output outside of the probability space it's been converging on.
+Similarly, if you are using a reasoning model, you need to make sure whether you are comparing with or without thinking enabled.
+**Few-shots samples**
+Two things are easy to mess up with few-shot samples: the number of few-shot examples, which ones you are using, and their specific ordering
+<Sidenote>
+The importance of using the same examples is not too surprising, if we assume some samples are better at expressing the task than others. More surprising maybe: you not only need to use the exact same samples, but also present them in the **exact same order**. Varying the order on the same samples led us to observe up to 3 points of difference on some subsets of MMLU (you can see [some results here](https://huggingface.co/blog/evaluation-structured-outputs) , it's the third colorgrid)
+</Sidenote>
+This is also a place where paying attention to the random seeds is important.
+**Parameters**
+For generative evaluations, parameters to pay attention to are making sure you are 1) using the **same end of sentence token** (you probably should not be using a default one for chat and reasoning models); 2) allowing your model to **generate the same number of tokens** for the evaluation (this is particularly crucial for reasoning models, which require a huge numbers of tokens in thinking mode); 3) if using sampling, that you are using the **same seed/temperature parameters**.