evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 10 days ago

Commit

c768f1d

1 Parent(s): 8537f75

message

Browse files

Files changed (3) hide show

app/src/content/article.mdx +0 -4
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +172 -142
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx +3 -1

app/src/content/article.mdx CHANGED Viewed

@@ -99,10 +99,6 @@ Best (but rarest) metrics are functional or based on rule based verifiers (thoug
 <DesigningAutomaticEvaluation />
-https://x.com/Kangwook_Lee/status/1993438649963164121
 <TroubleshootingInference />


99	<DesigningAutomaticEvaluation />
100
101




102
103	<TroubleshootingInference />
104

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -8,18 +8,42 @@ import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx
 ### Dataset
-#### Using existing data
-- Use existing datasets, and assemble them differently
-You can aggregate existing data from different sources, evaluating a relevant capability for your task. A number of evaluation datasets are for example constructed from aggregating human evaluation datasets (such as MATH, LSAT, etc). In this case, follow the steps above.
 #### Creating a dataset manually
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically
-- **Using synthetic data from models**: On this, you can check the very cool [Cosmopedia](https://huggingface.co/blog/cosmopedia) blog by cool HF colleagues! It's mostly studying how to create a synthetic training dataset, but similar techniques can be used for evaluation. Make sure to manually check/filter/inspect your dataset afterwards (following the above steps).
-- **Using rule-based techniques**: If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination! For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), etc.
 #### Choosing a prompt
 The prompt is going to define:
@@ -98,14 +122,46 @@ However, nowadays most evaluations are generative: using generations (QA, questi
 If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
-If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something.
-<Note title="Normalization">
-Normalizations can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.
-They are vital for specific tasks, such as math evaluations, where you want to extract your specific result from formatted outputs.
 In the below table, we make a list of some issues we saw happening when extracting predictions from model outputs using SymPy naively for the MATH dataset, and how Math-Verify, a specific math parser, solved these.
 | 📄 Example |  ❗️Issue | ✅ Math-Verify | 🛑 Naive Approach |
@@ -116,7 +172,6 @@ In the below table, we make a list of some issues we saw happening when extracti
 | \(23\) | Failed extraction due to latex borders | `23` | None |
 | \((- \infty, -14) \cup (-3, \infty)\). | Failed extraction due to interval | Union(Interval.open(-oo, -14), Interval.open(-3, oo)) | None |
 | 100\% | Failed extraction due to invalid symbol | `1` | None |
-| \begin{pmatrix}\frac{1}{50}&\frac{7}{50}\\frac{7}{50}&\frac{49}{50}\end{pmatrix} | Failed extraction due to Matrix | Matrix([[1/50, 7/50], [7/50, 49/50]]) | None |
 | 1/3 == 0.333333 | No rounding support | True | False |
 | sqrt(1/2)*7 == sqrt(0.5)*7 | No numerical evaluation support | True | False |
@@ -124,78 +179,61 @@ In the below table, we make a list of some issues we saw happening when extracti
 Look at [this blog](https://huggingface.co/blog/math_verify_leaderboard) for more details!
 </Sidenote>
-They will also be important if you want to evaluate with added mechanisms for accuracy, such as Chain of Thought, as you'll need to remove the reasoning trace from the actual result
-</Note>
-Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
-## The hardest part of evaluation: Scoring free form text
-### Automatically
-#### Metrics
-Most ways to automatically compare a string of text to a reference are match based.
-The easiest but least flexible match based metrics are exact matches of token sequences (with or without normalization, of full sentences or prefix only, etc).
-The translation and summarisation fields have also introduced automatic metrics which compare n-grams in sequences, like BLEU (& it's variants, like GLEU, SacreBLEU, etc), METEOR, ROUGE, chrF.
-They also introduced static model based metrics, usually based on embedding distances of sequences for similarity, like BLEURT, MAUVE, COMET.
-Once you have an accuracy score per sample, you can aggregate it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs.
-Look at the F1 score, precision, recall, or MCC, if your score is binary.
-If your score is continuous, you can want to use a mean squared error, mean absolute error, look at the R2 or at correlation coefficients (Pearson or Spearman).
-More generally, when picking your metric, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
-<Sidenote>
-To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/).
-</Sidenote>
-<Note title="Pros and cons of using automated metrics">
-Automated benchmarks have the following advantages:
-- **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
-- **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
-- **Understandability**: Most automated metrics are very understandable.
-  *Eg: an exact match will tell you if the generated text matches perfectly with the reference, and an accuracy score will tell you in how many cases the selected choice was the correct one (this will be a bit less the case for metrics such as `BLEU` or `ROUGE` for example).*
-However, they also present the following limitations:
-- **Reduced use on more complex tasks**: Automated benchmarks are working well for tasks where performance is easy to define and assess (for example, classification). More complex capabilities, on the other hand, are harder to decompose into well-defined and precise tasks.
-  *Eg: what does "good at math" mean? Is it being good at arithmetic? - at logic? - able to reason on new mathematical concepts?*
-  This led to the use of more **generalist** evaluations, which no longer decompose capabilities in sub-tasks, but assuming that general performance will be a **good proxy** for what we aim to measure.
-- **Contamination**: Once a dataset is published publicly in plain text, it will end up in model training datasets. This means that you have no guarantee when scoring a model that it has not parsed the evaluation data before.
-</Note>
-#### Using functional testing
-In the field of code, you want to evaluate generated programs not only on their semantics, but on their actual function. A good way to do so is therefore to check if code generated to follow a prompt passes correctly a suite of unit-tests designed to fit the task.
-This functionality approach is extremely promising, as it
-- allows to generate test cases more easily (in many cases, you can generate rule-based test cases)
-- therefore reducing overfitting
-- tests models on specific active capabilities
-It's however an approach which requires creativity to be translated to text!
-A good example of this are IFEval and IFBench, an evaluation benchmark which tests if models can follow instructions. It works by creating a number of formatting instructions (*Add this number of bullet points. Capitalize only one sentence.* etc), and strictly testing if the format is followed. More work is clearly needed to extend this idea to other features of text to analyze!
 ### With humans
 Human evaluation is simply asking humans to score predictions.
-Human evaluation is very interesting, because it's **flexibility** (if you define clearly enough what you are evaluating, you can get scores for about anything!), **uncontaminated** (If you ask humans to write new questions to test your system, they should not be present in your training data (hopefully)), and correlates well with human preference for obvious reasons.
 <Sidenote>
-However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.*
 </Sidenote>
-However, it also present a number of biases:
-- **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
-- **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.)
-- **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
-- **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
-There are 3 main ways to do evaluation with paid annotators:
-- If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning.
-- If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans.
-- If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
 Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
 However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
@@ -209,7 +247,11 @@ Pros of casual human evaluations are that they are cheap, scale better and allow
 it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
 </Sidenote>
 ### With judge models
 Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
@@ -219,46 +261,35 @@ Judge models range from small specialized classifiers (think "spam filter", but
 Model as judges allow to score text on complex and nuanced properties.
 For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
-That's where models as judges come into play.
 They are used on 3 main tasks:
 - *Scoring a model generation*, on a provided scale, to assess a property of the text (fluency, toxicity, coherence, persuasiveness, etc).
 - *Pairwise scoring*: comparing a pair model outputs to pick the best text with respect to a given property
 - *Computing the similarity* between a model output and a reference
-*Note: In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md))*
 #### Pros and cons of using judge-LLMs
 People in favor of judge LLMs have been claiming they provide better:
 - **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner (theoretically - in my opinion, they add more subtle bias than they are worth)
 - **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data (if you control for temperature).
 - **Cost**: They are cheap to instantiate, as they don't require to train a new model, and can just rely on good prompting and an existing high quality LLM. They are also cheaper than paying actual human annotators (capitalism...).
-- **Alignment with human judgments**: They are somehow correlated with human judgments.
-In my opinion, using LLM judges correctly is extremely tricky, and it's easy to be deceived for critical use cases:
-- LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see [model-as-a-judge/Tips and tricks]). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
 - They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
 - They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
-<Note title="Critical limitations of LLM judges" emoji="⚠️" variant="warning">
-Using LLM judges is extremely tricky:
-- **Hidden biases**: Harder to detect than human biases; creates echo-chamber effects
-- **Data overload**: Generates massive synthetic data needing quality examination
-- **False objectivity**: Seems objective but reinforces subtle biases
-- **Expert humans better**: For critical use cases, expert annotators provide higher quality
-See [Tips and tricks](./tips-and-tricks) for bias mitigation strategies.
-</Note>
-This section is a bit long, because you need to be well aware of their limitations: a lot of people are blindly jumping into using model judges because they seem easier, but then end up with uninsterpretable data with tricky bias to extract.
-If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) (⭐) by Aymeric Roucher on how to setup your first LLM as judge!
 You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.
 #### Getting a Judge-Model
-When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4),  using [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
 **Using a generalist LLM**
@@ -286,56 +317,48 @@ You'll find a good cost analysis of model providers [here](https://huggingface.c
 You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
-Some existing models:
-- Flow-Judge-v0.1 ([weights](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)), 3.8B parameters, a Phi-3.5-mini-instruct fine-tuned on a synthetic preference dataset
-- Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset. A 7B parameter [v2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) also exists, a Mistral-7B-Instruct-v0.2 fine-tune on a bigger synthetic preference dataset, with added weight merging
-- JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models.
 **Training your own**
-You can also make the choice to train or fine-tune your own LLM-as-judge. (I would avoid doing this, unless you are on a very niche topic).
-You first need to gather preference data for your task of interest, which can come
 - From existing [human preference datasets](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
 - From model generated preference data (which you can generate following the above tiny-model judges papers data sections, or get directly, for example from the Prometheus [preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) and [feedback](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) collections).
-Then you need to decide whether to start from a small model to train from scratch, or from an existing model, that you can
-- distill into a new smaller model
-- quantize.
-- then fine-tune (using peft or adapter weights if the model is big and your training compute low) using the above data
-	- apparently [starting from a reward model works better than from an instruct model](https://x.com/dk21/status/1826292289930674590)
 #### Designing your evaluation prompt
 Once you've selected your model, you need to define what is the best possible prompt for your task.
-Some general guidelines I've come across online when designing the prompt itself are:
-- Provide a clear description of the task at hand:
-	- `Your task is to do X`.
-	- `You will be provided with Y`.
-- Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
-	- `You should evaluate property Z on a scale of 1 - 5, where 1 means ...`
-	- `You should evaluate if property Z is present in the sample Y. Property Z is present if ...`
-- Provide some additional "reasoning" evaluation steps:
-	- `To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...`
-- Specify the desired output format (adding fields will help consistency)
-	- `Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}`
-<Note title="Core prompt design principles" emoji="📝" variant="info">
-**Essential elements for effective judge prompts:**
-- **Clear task description**: Specify exactly what the judge needs to do
-- **Detailed criteria**: Provide explicit scoring scales with clear definitions
-- **Reasoning steps**: Guide the judge through the evaluation process
-- **Structured output**: Use JSON format for consistency and parsability
 </Note>
 You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
-Other tidbits:
-- Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
-- If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
-- Using one prompt per capability to score tends to give better and more robust results
 You can also improve accuracy using the following, possibly more costly, techniques:
 - **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
@@ -347,22 +370,14 @@ You can also improve accuracy using the following, possibly more costly, techniq
 	- You can also experiment with using one model with variations on temperature
 - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
-<Note title="High-stakes evaluation requires rigor" emoji="⚠️" variant="warning">
-For production or critical use cases, use methodologies transferred from the humanities:
-- Compute inter-annotator agreement metrics
-- Use proper survey design methodology to mitigate bias
-</Note>
-However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
 #### Evaluating your evaluator
 Before using a judge-LLM in production or at scale, you want to evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
 <Note>
-This will be easier to do if it predicts binary outputs, because you'll be able to interpretable classification metrics (accuracy/recall/precision). If it predicts scores on a scale, it will be much harder to estimate the quality of the correlation with a reference.*
 </Note>
 So, once you have selected your model judge and its prompt, you'll need to do the following.
@@ -398,24 +413,39 @@ You need to decide what your threshold for acceptance is. Depending on how hard
 **Mitigating well known biases of LLM as judges**
 <Note title="Known LLM judge biases and mitigations" emoji="⚠️" variant="warning">
-- **Lack of internal consistency**: a judge might give you different judgments if you prompt it several times (if the temperature is not 0)
-	- You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
-- **Self-preference**: they tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
-	- You can mitigate this by using a jury
-- **Blindness to input perturbation**: models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale.
-	- You can mitigate this by
-		- asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
-		- providing a coherent grading scale in the prompt.
-- **Position-bias**: they tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice
-	- You can mitigate this by
-		- switching answer positions randomly
-		- computing the log-probabilities of all possible choices to get a normalized answer
-- **Verbosity-bias** (or length-bias): they tend to like more verbose answers
-	- You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
-- **Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
-	- However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.
-- **Format bias**: they tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
-	- You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
 </Note>
 **Picking correct tasks for an LLM judge**

 ### Dataset
+#### Using existing data
+You can use existing datasets are are, and change the prompting or metrics associated (as has been done for older evaluations to adapt them to new prompting method), but you can also aggregate datasets.
+Dataset aggregation is a good approach when you want to evaluate a specific capability that isn't well-covered by a single benchmark. Rather than starting from scratch, you can combine samples from multiple existing datasets to create a targeted evaluation suite. That's for examples what the authors of the "Measuring AGI" paper did recently to try to create a new "AGI evaluation" dataset.
+When aggregating datasets, pay attention to whether
+- they contain redundant data (most mathematics datasets are rewrites or aggregations of the same initial problems)
+- you need balanced representation across sources (you might not want one dataset to dominate and skew your evaluation) - this will also determine whether to aggregate scores across all samples or per subset
+- formats and difficulty levels are compatible (typically, if creating a unified dataset, beware of mixing up samples requiring sampling or not).
+<Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
 #### Creating a dataset manually
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically
+**Using rule-based techniques**
+If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination! For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), [ZebraLogic](https://arxiv.org/pdf/2502.01100), IFEval, or GSMTemplate among others.
+Tasks which usually fit this paradigm test mathematical, logical, or coding abilities.
+**Creating synthetic data with models**
+If you want to create synthetic data, you usually start from a number of seed documents that will act as your ground truth. These can be internal and specific to your use cases, or available on the web and of high quality (like Wikipedia, Stack Overflow, ...). You'll then likely need to chunk your data into units of self contained meaning.
+You'll then likely want a model to design questions from your data. For this, you will need to select a frontier model, and design a very good prompt asking the model to create use-case relevant questions from the provided data. It's better if you ask the model to provide the source on which it based its question.
+You can also use seed prompts as examples to provide to an external modeln for it to write the prompt for your model to generate new questions, if you want to go full synthetic ^^
+Once this is done, you can do an automatic validation by using a model from a different family line on your ground truth + questions + answer as a model judge.
+<Note title="Always make sure that you're checking your data" emoji="⚠️" variant="warning">
+No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
+</Note>
 #### Choosing a prompt
 The prompt is going to define:
 If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
+If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something. Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
+## The hardest part of evaluation: Scoring free form text
+### Automatically
+#### Metrics
+Most ways to automatically compare a string of text to a reference are match based.
+The easiest but least flexible match based metrics are **exact matches** of token sequences. <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong.
+The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap.
+Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
+Once you have an accuracy score per sample, you can aggregate it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
+If your score is **binary**, look at the **precision** (critical when false positives are costly), **recall** (critical when missing positives is costly), **F1 score** (balances precision and recall, good for imbalanced data), or **MCC** (Matthews Correlation Coefficient, which works well with imbalanced datasets by considering all confusion matrix elements).
+If your score is **continuous**, you can use **mean squared error** (penalizes large errors but heavily weights outliers), **mean absolute error** (more balanced than MSE), or if you assume your data should follow a specific linear regression model, you can look at measures like the **R²** or correlation coefficients like **Pearson** (for linear relationships, assumes normality) or **Spearman** (for monotonic relationships without normality assumptions).
+More generally, when picking your metric and its aggregation, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
+<Sidenote>
+To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/). You'll also find a complete list of metrics and their uses in [this organisation](https://huggingface.co/evaluate-metric).
+</Sidenote>
+<Note title="Pros and cons of using automated metrics">
+Automated benchmarks have the following advantages:
+- **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
+- **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
+- **Understandability**: Most automated metrics are very understandable.
+However, they also present have a **reduced use on more complex tasks**: an automatic metric either requires you to have a perfect, unique and unambiguous reference/gold, like for tasks where performance is easy to define and assess (for example, classification of toxicity, knowledge questions with a single answer). More complex capabilities, on the other hand, are harder to decompose into a single and simple answer.
+</Note>
+#### Normalization
+Normalization means changing a string of characters to have it fit a specific reference format. For example, when comparing a model prediction to a reference, you usually don't want to penalize extra spacing in the prediction, or added punctuation or capitalisation. That's why you normalize your prediction.
+They are vital for specific tasks, such as math evaluations, where you want to extract an equation from a longer prediction, and compare it to a reference.
 In the below table, we make a list of some issues we saw happening when extracting predictions from model outputs using SymPy naively for the MATH dataset, and how Math-Verify, a specific math parser, solved these.
 | 📄 Example |  ❗️Issue | ✅ Math-Verify | 🛑 Naive Approach |
 | \(23\) | Failed extraction due to latex borders | `23` | None |
 | \((- \infty, -14) \cup (-3, \infty)\). | Failed extraction due to interval | Union(Interval.open(-oo, -14), Interval.open(-3, oo)) | None |
 | 100\% | Failed extraction due to invalid symbol | `1` | None |
 | 1/3 == 0.333333 | No rounding support | True | False |
 | sqrt(1/2)*7 == sqrt(0.5)*7 | No numerical evaluation support | True | False |
 Look at [this blog](https://huggingface.co/blog/math_verify_leaderboard) for more details!
 </Sidenote>
+Normalizations can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still help provide signal at the task level.
+They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
+#### Adding sampling
+When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
+This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
+Common sampling-based metrics are:
+- **pass@k over n**: Given n generated samples, measures whether at least k passes the test. <Sidenote> You'll find two functions for this metric: computed as: $\text{pass}@k = \mathbb{E}[\text{at least 1 correct among k samples}]$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
+- **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
+- **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
+- **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
+When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
+<Note title="When can you use sampling and when shouldn't you?">
+**For training evaluation/ablations**: ❌ Generally avoid sampling metrics as they're expensive and add variance. Stick to greedy decoding with a fixed seed.
+**For post-training evaluation**: ✅ Sampling metrics can reveal capabilities that greedy decoding misses (especially for more complex tasks requiring reasoning, math or code).
+**At inference**: ✅ These metrics help estimate how much improvement you can get from sampling multiple times at inference. It's particularly cool when you want to study how far you can push small models with test time compute.
+However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
+</Note>
+#### Using functional testing
+Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
+**IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
+For instance, instructions might specify:
+- *"Include exactly 3 bullet points"* → verify the output contains exactly 3 bullets
+- *"Capitalize only the first sentence"* → parse and check capitalization patterns
+- *"Use the word 'algorithm' at least twice"* → count word occurrences
+- *"Your response must be in JSON format with keys 'answer' and 'reasoning'"* → validate JSON structure
+Each constraint can be checked with a specific rule-based verifier, making these evaluations more unambiguous, interpretable, fast, and considerably less costly than using models as judges.
+This functional approach works particularly well for instruction following, but requires creativity to extend to other text properties. The key is identifying aspects of text that can be verified programmatically rather than through semantic comparison.
+<Sidenote>
+Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
+</Sidenote>
 ### With humans
 Human evaluation is simply asking humans to score predictions.
+Human evaluation is very interesting, because of its **flexibility** (if you define clearly enough what you are evaluating, you can get scores for about anything!), **inherent un-contamination** (if humans write new questions to test your system, they should not be present in your training data, hopefully), and **good correlation with human preference** for obvious reasons.
 <Sidenote>
+However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
 </Sidenote>
+There are 3 main ways to do evaluation with paid annotators. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
 Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
 However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
 it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
 </Sidenote>
+Overall, however, human evaluation has a number of well known biases:
+- **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
+- **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.)
+- **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
+- **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
 ### With judge models
 Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
 Model as judges allow to score text on complex and nuanced properties.
 For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
 They are used on 3 main tasks:
 - *Scoring a model generation*, on a provided scale, to assess a property of the text (fluency, toxicity, coherence, persuasiveness, etc).
 - *Pairwise scoring*: comparing a pair model outputs to pick the best text with respect to a given property
 - *Computing the similarity* between a model output and a reference
+<Sidenote> In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md)) </Sidenote>
 #### Pros and cons of using judge-LLMs
 People in favor of judge LLMs have been claiming they provide better:
 - **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner (theoretically - in my opinion, they add more subtle bias than they are worth)
 - **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data (if you control for temperature).
 - **Cost**: They are cheap to instantiate, as they don't require to train a new model, and can just rely on good prompting and an existing high quality LLM. They are also cheaper than paying actual human annotators (capitalism...).
+In my opinion, using LLM judges correctly is extremely tricky, and it's **easy to be deceived for critical use cases**:
+- LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see below). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
 - They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
 - They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
+This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
+<Note title="Getting started with an LLM judge">
+If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) on how to setup your first LLM as judge!
 You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.
+</Note>
 #### Getting a Judge-Model
+When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4), [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
 **Using a generalist LLM**
 You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
+Some existing models as of 2024 were Flow-Judge-v0.1 ([weights](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)), 3.8B parameters, a Phi-3.5-mini-instruct fine-tuned on a synthetic preference dataset, Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset, and JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models. Newer alternatives surely exist!
 **Training your own**
+You can also make the choice to train or fine-tune your own LLM-as-judge. (I would avoid doing this, unless you are on a very niche domain).
+If you go in that direction, you'll first need to gather preference data for your task of interest, which can come
 - From existing [human preference datasets](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
 - From model generated preference data (which you can generate following the above tiny-model judges papers data sections, or get directly, for example from the Prometheus [preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) and [feedback](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) collections).
+Then you need to decide whether to start from a small model to train from scratch, or from an existing model, that you can distill into a new smaller model, or quantize, then fine-tune (using peft or adapter weights if the model is big and your training compute low) using the above data.
+<Sidenote> Apparently [starting from a reward model works better than from an instruct model](https://x.com/dk21/status/1826292289930674590) </Sidenote>
 #### Designing your evaluation prompt
 Once you've selected your model, you need to define what is the best possible prompt for your task.
+<Note title="Prompt design guidelines" emoji="📝" variant="info">
+Provide a clear description of the task at hand:
+- `Your task is to do X`.
+- `You will be provided with Y`.
+Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
+- `You should evaluate property Z on a scale of 1 - 5, where 1 means ...`
+- `You should evaluate if property Z is present in the sample Y. Property Z is present if ...`
+Provide some additional "reasoning" evaluation steps:
+- `To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...`
+Specify the desired output format (adding fields will help consistency)
+- `Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}`
 </Note>
 You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
+<Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
+Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
+If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
+Using one prompt per capability to score tends to give better and more robust results
+</Note>
 You can also improve accuracy using the following, possibly more costly, techniques:
 - **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
 	- You can also experiment with using one model with variations on temperature
 - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
+If you are working on critical tasks (medical domain for example), make sure to use methodologies transferred from the humanities, and 1) compute inter-annotator agreement metrics to make sure your evaluators are as unbiased as possible, 2) Use proper survey design methodology when creating your scoring grid to mitigate bias. However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
 #### Evaluating your evaluator
 Before using a judge-LLM in production or at scale, you want to evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
 <Note>
+This will be easier to do if it predicts binary outputs, because you'll be able to interpretable classification metrics (accuracy/recall/precision). If it predicts scores on a scale, it will be much harder to estimate the quality of the correlation with a reference. Models are notoriously bad at predicting on a scale.
 </Note>
 So, once you have selected your model judge and its prompt, you'll need to do the following.
 **Mitigating well known biases of LLM as judges**
 <Note title="Known LLM judge biases and mitigations" emoji="⚠️" variant="warning">
+**Lack of internal consistency**:
+A judge might give you different judgments if you prompt it several times (if the temperature is not 0)
+➡️ You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
+**Self-preference**
+Models tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
+➡️ You can mitigate this by using a jury
+**Blindness to input perturbation**
+Models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale.
+Mitigations:
+➡️ asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
+➡️ or providing a coherent grading scale in the prompt.
+**Position-bias**.
+Models tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice. Mitigations:
+➡️ switching answer positions randomly
+➡️ computing the log-probabilities of all possible choices to get a normalized answer
+**Verbosity-bias** (or length-bias)
+Models tend to like more verbose answers
+➡️ You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
+**Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
+<Sidenote> However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.</Sidenote>
+**Format bias**
+Models tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
+➡️ You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
 </Note>
 **Picking correct tasks for an LLM judge**

app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED Viewed

@@ -11,10 +11,12 @@ You can evaluate **specific capabilities** on their own - it's usually quite int
 Reasoning and commonsense datasets are often “historic” datasets, built in the age of BERT and embeddings model, before the LLM craze. They were quite challenging at the time (especially because they were often adversarially built for models of the time), but now they are 1) too easy 2) contaminated/saturated, and should only be used for ablations or as pretraining evaluations. The bigger datasets also sometimes contain errors or low quality questions as they tend to have been built through Amazon Mechanical Turk in order to scale up fast and at low cost (what is now done by using LLMs to generate evaluation questions).
-[ARC]([https://arxiv.org/abs/1803.05457](https://arxiv.org/abs/1803.05457)) (2018) (not to confuse with ARC-AGI) is a grade school science MCQA dataset built from human tests. The choices were selected adversarially for word co-occurence systems at the time. It has several subsets, the higher quality `challenge` one is still in use today for pretraining. [WinoGrande]([https://arxiv.org/pdf/1907.10641](https://arxiv.org/pdf/1907.10641)) (2019) is a crowdsourced (mechanical turk + validation) pronoun resolution/fill in the blank dataset, using adversarial pairs of items to trick models. Both these datasets have been quite hard for models until 2022 to 2023.
 A number of historic datasets are looking specifically at reasoning requiring some sort of commonsense understanding and grounding. [HellaSwag]([https://arxiv.org/abs/1905.07830](https://arxiv.org/abs/1905.07830)) (2019) requires LLMs to select the correct next sentence in a list of adversarial choices, where the text comes from captions in ActivityNet and from tutorials in Wikihow. (It’s the follow up of a dataset called Swag). As most sentences come from tutorials or descriptions of activities, they often require physical commonsense grounding to solve. In the same vein, [CommonsenseQA]([https://arxiv.org/abs/1811.00937](https://arxiv.org/abs/1811.00937)) (2018) is a dataset of commonsense MCQA built from ConceptNet - annotators write questions, then use conceptually close distractors as options. [PIQA]([https://arxiv.org/abs/1911.11641](https://arxiv.org/abs/1911.11641)) (2019) is specifically looking at physical commonsense questions (created from examples from [Instructables.com](http://Instructables.com), with again adversarial choices from semantic perturbations or rewriting). [OpenBookQA]([https://arxiv.org/abs/1809.02789](https://arxiv.org/abs/1809.02789)) (2018) provides open book facts to help answer MCQA questions - however, these questions also require latent common sense knowledge.
 #### Knowledge
 The main evaluation dataset for knowledge has been [MMLU](https://arxiv.org/abs/2009.03300) (2020). It reached saturation/contamination, and after more in depth examination, a number of issues were identified: incomplete questions referring absent documents, incorrect ground truths, ambiguous questions, and blatant americano-centrism in the topics chosen. It was therefore cleaned in [MMLU-Redux](https://arxiv.org/abs/2406.04127) (2024), extended with more complex questions and more answers in [**MMLU-Pro**](https://arxiv.org/abs/2406.01574) (2024, the main replacement used by the community at the moment), and translated/annotated for cultural bias in [Global-MMLU](https://arxiv.org/abs/2412.03304) (2024). These are used mostly for pretraining evaluations and ablations.

 Reasoning and commonsense datasets are often “historic” datasets, built in the age of BERT and embeddings model, before the LLM craze. They were quite challenging at the time (especially because they were often adversarially built for models of the time), but now they are 1) too easy 2) contaminated/saturated, and should only be used for ablations or as pretraining evaluations. The bigger datasets also sometimes contain errors or low quality questions as they tend to have been built through Amazon Mechanical Turk in order to scale up fast and at low cost (what is now done by using LLMs to generate evaluation questions).
+[ARC](https://arxiv.org/abs/1803.05457) (2018) (not to confuse with ARC-AGI) is a grade school science MCQA dataset built from human tests. The choices were selected adversarially for word co-occurence systems at the time. It has several subsets, the higher quality `challenge` one is still in use today for pretraining. [WinoGrande]([https://arxiv.org/pdf/1907.10641](https://arxiv.org/pdf/1907.10641)) (2019) is a crowdsourced (mechanical turk + validation) pronoun resolution/fill in the blank dataset, using adversarial pairs of items to trick models. Both these datasets have been quite hard for models until 2022 to 2023.
 A number of historic datasets are looking specifically at reasoning requiring some sort of commonsense understanding and grounding. [HellaSwag]([https://arxiv.org/abs/1905.07830](https://arxiv.org/abs/1905.07830)) (2019) requires LLMs to select the correct next sentence in a list of adversarial choices, where the text comes from captions in ActivityNet and from tutorials in Wikihow. (It’s the follow up of a dataset called Swag). As most sentences come from tutorials or descriptions of activities, they often require physical commonsense grounding to solve. In the same vein, [CommonsenseQA]([https://arxiv.org/abs/1811.00937](https://arxiv.org/abs/1811.00937)) (2018) is a dataset of commonsense MCQA built from ConceptNet - annotators write questions, then use conceptually close distractors as options. [PIQA]([https://arxiv.org/abs/1911.11641](https://arxiv.org/abs/1911.11641)) (2019) is specifically looking at physical commonsense questions (created from examples from [Instructables.com](http://Instructables.com), with again adversarial choices from semantic perturbations or rewriting). [OpenBookQA]([https://arxiv.org/abs/1809.02789](https://arxiv.org/abs/1809.02789)) (2018) provides open book facts to help answer MCQA questions - however, these questions also require latent common sense knowledge.
+A more recent cool reasoning dataset is [Zebra Logic](https://arxiv.org/abs/2502.01100), using logic puzzles to test model reasoning capabilities. Their methods allows for infinite generation of puzzles, so little contamination.
 #### Knowledge
 The main evaluation dataset for knowledge has been [MMLU](https://arxiv.org/abs/2009.03300) (2020). It reached saturation/contamination, and after more in depth examination, a number of issues were identified: incomplete questions referring absent documents, incorrect ground truths, ambiguous questions, and blatant americano-centrism in the topics chosen. It was therefore cleaned in [MMLU-Redux](https://arxiv.org/abs/2406.04127) (2024), extended with more complex questions and more answers in [**MMLU-Pro**](https://arxiv.org/abs/2406.01574) (2024, the main replacement used by the community at the moment), and translated/annotated for cultural bias in [Global-MMLU](https://arxiv.org/abs/2412.03304) (2024). These are used mostly for pretraining evaluations and ablations.