Clémentine commited on
Commit
c768f1d
·
1 Parent(s): 8537f75
app/src/content/article.mdx CHANGED
@@ -99,10 +99,6 @@ Best (but rarest) metrics are functional or based on rule based verifiers (thoug
99
  <DesigningAutomaticEvaluation />
100
 
101
 
102
- https://x.com/Kangwook_Lee/status/1993438649963164121
103
-
104
-
105
-
106
 
107
  <TroubleshootingInference />
108
 
 
99
  <DesigningAutomaticEvaluation />
100
 
101
 
 
 
 
 
102
 
103
  <TroubleshootingInference />
104
 
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -8,18 +8,42 @@ import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx
8
 
9
  ### Dataset
10
 
11
- #### Using existing data
12
- - Use existing datasets, and assemble them differently
13
- You can aggregate existing data from different sources, evaluating a relevant capability for your task. A number of evaluation datasets are for example constructed from aggregating human evaluation datasets (such as MATH, LSAT, etc). In this case, follow the steps above.
 
 
 
 
 
 
 
 
14
 
15
  #### Creating a dataset manually
16
 
17
  <UsingHumanAnnotators />
18
 
19
  #### Creating a dataset synthetically
 
 
 
 
 
 
 
20
 
21
- - **Using synthetic data from models**: On this, you can check the very cool [Cosmopedia](https://huggingface.co/blog/cosmopedia) blog by cool HF colleagues! It's mostly studying how to create a synthetic training dataset, but similar techniques can be used for evaluation. Make sure to manually check/filter/inspect your dataset afterwards (following the above steps).
22
- - **Using rule-based techniques**: If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination! For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), etc.
 
 
 
 
 
 
 
 
 
23
 
24
  #### Choosing a prompt
25
  The prompt is going to define:
@@ -98,14 +122,46 @@ However, nowadays most evaluations are generative: using generations (QA, questi
98
 
99
  If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
100
 
101
- If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
- <Note title="Normalization">
 
104
 
105
- Normalizations can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
- They are vital for specific tasks, such as math evaluations, where you want to extract your specific result from formatted outputs.
108
 
 
 
 
109
  In the below table, we make a list of some issues we saw happening when extracting predictions from model outputs using SymPy naively for the MATH dataset, and how Math-Verify, a specific math parser, solved these.
110
 
111
  | 📄 Example | ❗️Issue | ✅ Math-Verify | 🛑 Naive Approach |
@@ -116,7 +172,6 @@ In the below table, we make a list of some issues we saw happening when extracti
116
  | \(23\) | Failed extraction due to latex borders | `23` | None |
117
  | \((- \infty, -14) \cup (-3, \infty)\). | Failed extraction due to interval | Union(Interval.open(-oo, -14), Interval.open(-3, oo)) | None |
118
  | 100\% | Failed extraction due to invalid symbol | `1` | None |
119
- | \begin{pmatrix}\frac{1}{50}&\frac{7}{50}\\frac{7}{50}&\frac{49}{50}\end{pmatrix} | Failed extraction due to Matrix | Matrix([[1/50, 7/50], [7/50, 49/50]]) | None |
120
  | 1/3 == 0.333333 | No rounding support | True | False |
121
  | sqrt(1/2)*7 == sqrt(0.5)*7 | No numerical evaluation support | True | False |
122
 
@@ -124,78 +179,61 @@ In the below table, we make a list of some issues we saw happening when extracti
124
  Look at [this blog](https://huggingface.co/blog/math_verify_leaderboard) for more details!
125
  </Sidenote>
126
 
127
- They will also be important if you want to evaluate with added mechanisms for accuracy, such as Chain of Thought, as you'll need to remove the reasoning trace from the actual result
128
- </Note>
129
 
130
- Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
131
 
 
132
 
133
- ## The hardest part of evaluation: Scoring free form text
 
134
 
135
- ### Automatically
 
 
 
 
136
 
137
- #### Metrics
138
- Most ways to automatically compare a string of text to a reference are match based.
139
 
140
- The easiest but least flexible match based metrics are exact matches of token sequences (with or without normalization, of full sentences or prefix only, etc).
141
- The translation and summarisation fields have also introduced automatic metrics which compare n-grams in sequences, like BLEU (& it's variants, like GLEU, SacreBLEU, etc), METEOR, ROUGE, chrF.
142
- They also introduced static model based metrics, usually based on embedding distances of sequences for similarity, like BLEURT, MAUVE, COMET.
143
-
144
- Once you have an accuracy score per sample, you can aggregate it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs.
145
- Look at the F1 score, precision, recall, or MCC, if your score is binary.
146
- If your score is continuous, you can want to use a mean squared error, mean absolute error, look at the R2 or at correlation coefficients (Pearson or Spearman).
147
 
 
 
148
 
149
- More generally, when picking your metric, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
150
- <Sidenote>
151
- To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/).
152
- </Sidenote>
153
 
154
- <Note title="Pros and cons of using automated metrics">
155
- Automated benchmarks have the following advantages:
156
- - **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
157
- - **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
158
- - **Understandability**: Most automated metrics are very understandable.
159
- *Eg: an exact match will tell you if the generated text matches perfectly with the reference, and an accuracy score will tell you in how many cases the selected choice was the correct one (this will be a bit less the case for metrics such as `BLEU` or `ROUGE` for example).*
160
 
161
- However, they also present the following limitations:
162
- - **Reduced use on more complex tasks**: Automated benchmarks are working well for tasks where performance is easy to define and assess (for example, classification). More complex capabilities, on the other hand, are harder to decompose into well-defined and precise tasks.
163
- *Eg: what does "good at math" mean? Is it being good at arithmetic? - at logic? - able to reason on new mathematical concepts?*
164
- This led to the use of more **generalist** evaluations, which no longer decompose capabilities in sub-tasks, but assuming that general performance will be a **good proxy** for what we aim to measure.
165
- - **Contamination**: Once a dataset is published publicly in plain text, it will end up in model training datasets. This means that you have no guarantee when scoring a model that it has not parsed the evaluation data before.
166
- </Note>
167
 
168
- #### Using functional testing
169
- In the field of code, you want to evaluate generated programs not only on their semantics, but on their actual function. A good way to do so is therefore to check if code generated to follow a prompt passes correctly a suite of unit-tests designed to fit the task.
170
 
171
- This functionality approach is extremely promising, as it
172
- - allows to generate test cases more easily (in many cases, you can generate rule-based test cases)
173
- - therefore reducing overfitting
174
- - tests models on specific active capabilities
175
 
176
- It's however an approach which requires creativity to be translated to text!
 
 
177
 
178
- A good example of this are IFEval and IFBench, an evaluation benchmark which tests if models can follow instructions. It works by creating a number of formatting instructions (*Add this number of bullet points. Capitalize only one sentence.* etc), and strictly testing if the format is followed. More work is clearly needed to extend this idea to other features of text to analyze!
179
 
180
  ### With humans
181
  Human evaluation is simply asking humans to score predictions.
182
 
183
- Human evaluation is very interesting, because it's **flexibility** (if you define clearly enough what you are evaluating, you can get scores for about anything!), **uncontaminated** (If you ask humans to write new questions to test your system, they should not be present in your training data (hopefully)), and correlates well with human preference for obvious reasons.
184
 
185
  <Sidenote>
186
- However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.*
187
  </Sidenote>
188
 
189
- However, it also present a number of biases:
190
- - **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
191
- - **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.)
192
- - **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
193
- - **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
194
-
195
- There are 3 main ways to do evaluation with paid annotators:
196
- - If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning.
197
- - If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans.
198
- - If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
199
 
200
  Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
201
  However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
@@ -209,7 +247,11 @@ Pros of casual human evaluations are that they are cheap, scale better and allow
209
  it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
210
  </Sidenote>
211
 
212
-
 
 
 
 
213
 
214
  ### With judge models
215
  Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
@@ -219,46 +261,35 @@ Judge models range from small specialized classifiers (think "spam filter", but
219
  Model as judges allow to score text on complex and nuanced properties.
220
  For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
221
 
222
- That's where models as judges come into play.
223
-
224
  They are used on 3 main tasks:
225
  - *Scoring a model generation*, on a provided scale, to assess a property of the text (fluency, toxicity, coherence, persuasiveness, etc).
226
  - *Pairwise scoring*: comparing a pair model outputs to pick the best text with respect to a given property
227
  - *Computing the similarity* between a model output and a reference
228
 
229
- *Note: In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md))*
230
 
231
  #### Pros and cons of using judge-LLMs
232
  People in favor of judge LLMs have been claiming they provide better:
233
  - **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner (theoretically - in my opinion, they add more subtle bias than they are worth)
234
  - **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data (if you control for temperature).
235
  - **Cost**: They are cheap to instantiate, as they don't require to train a new model, and can just rely on good prompting and an existing high quality LLM. They are also cheaper than paying actual human annotators (capitalism...).
236
- - **Alignment with human judgments**: They are somehow correlated with human judgments.
237
 
238
- In my opinion, using LLM judges correctly is extremely tricky, and it's easy to be deceived for critical use cases:
239
- - LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see [model-as-a-judge/Tips and tricks]). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
240
  - They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
241
  - They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
242
 
243
- <Note title="Critical limitations of LLM judges" emoji="⚠️" variant="warning">
244
-
245
- Using LLM judges is extremely tricky:
246
- - **Hidden biases**: Harder to detect than human biases; creates echo-chamber effects
247
- - **Data overload**: Generates massive synthetic data needing quality examination
248
- - **False objectivity**: Seems objective but reinforces subtle biases
249
- - **Expert humans better**: For critical use cases, expert annotators provide higher quality
250
-
251
- See [Tips and tricks](./tips-and-tricks) for bias mitigation strategies.
252
- </Note>
253
 
254
- This section is a bit long, because you need to be well aware of their limitations: a lot of people are blindly jumping into using model judges because they seem easier, but then end up with uninsterpretable data with tricky bias to extract.
 
255
 
256
- If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) (⭐) by Aymeric Roucher on how to setup your first LLM as judge!
257
  You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.
 
258
 
259
  #### Getting a Judge-Model
260
 
261
- When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4), using [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
262
 
263
  **Using a generalist LLM**
264
 
@@ -286,56 +317,48 @@ You'll find a good cost analysis of model providers [here](https://huggingface.c
286
 
287
  You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
288
 
289
- Some existing models:
290
- - Flow-Judge-v0.1 ([weights](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)), 3.8B parameters, a Phi-3.5-mini-instruct fine-tuned on a synthetic preference dataset
291
- - Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset. A 7B parameter [v2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) also exists, a Mistral-7B-Instruct-v0.2 fine-tune on a bigger synthetic preference dataset, with added weight merging
292
- - JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models.
293
 
294
  **Training your own**
295
- You can also make the choice to train or fine-tune your own LLM-as-judge. (I would avoid doing this, unless you are on a very niche topic).
296
 
297
- You first need to gather preference data for your task of interest, which can come
298
  - From existing [human preference datasets](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
299
  - From model generated preference data (which you can generate following the above tiny-model judges papers data sections, or get directly, for example from the Prometheus [preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) and [feedback](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) collections).
300
 
301
- Then you need to decide whether to start from a small model to train from scratch, or from an existing model, that you can
302
- - distill into a new smaller model
303
- - quantize.
304
- - then fine-tune (using peft or adapter weights if the model is big and your training compute low) using the above data
305
- - apparently [starting from a reward model works better than from an instruct model](https://x.com/dk21/status/1826292289930674590)
306
 
307
  #### Designing your evaluation prompt
308
 
309
  Once you've selected your model, you need to define what is the best possible prompt for your task.
310
 
311
- Some general guidelines I've come across online when designing the prompt itself are:
312
- - Provide a clear description of the task at hand:
313
- - `Your task is to do X`.
314
- - `You will be provided with Y`.
315
- - Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
316
- - `You should evaluate property Z on a scale of 1 - 5, where 1 means ...`
317
- - `You should evaluate if property Z is present in the sample Y. Property Z is present if ...`
318
- - Provide some additional "reasoning" evaluation steps:
319
- - `To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...`
320
- - Specify the desired output format (adding fields will help consistency)
321
- - `Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}`
322
-
323
- <Note title="Core prompt design principles" emoji="📝" variant="info">
324
-
325
- **Essential elements for effective judge prompts:**
326
- - **Clear task description**: Specify exactly what the judge needs to do
327
- - **Detailed criteria**: Provide explicit scoring scales with clear definitions
328
- - **Reasoning steps**: Guide the judge through the evaluation process
329
- - **Structured output**: Use JSON format for consistency and parsability
330
 
 
 
 
 
 
 
 
 
 
331
  </Note>
332
 
333
  You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
334
 
335
- Other tidbits:
336
- - Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
337
- - If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
338
- - Using one prompt per capability to score tends to give better and more robust results
 
 
 
339
 
340
  You can also improve accuracy using the following, possibly more costly, techniques:
341
  - **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
@@ -347,22 +370,14 @@ You can also improve accuracy using the following, possibly more costly, techniq
347
  - You can also experiment with using one model with variations on temperature
348
  - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
349
 
350
- <Note title="High-stakes evaluation requires rigor" emoji="⚠️" variant="warning">
351
-
352
- For production or critical use cases, use methodologies transferred from the humanities:
353
- - Compute inter-annotator agreement metrics
354
- - Use proper survey design methodology to mitigate bias
355
- </Note>
356
-
357
- However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
358
 
359
  #### Evaluating your evaluator
360
 
361
  Before using a judge-LLM in production or at scale, you want to evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
362
 
363
-
364
  <Note>
365
- This will be easier to do if it predicts binary outputs, because you'll be able to interpretable classification metrics (accuracy/recall/precision). If it predicts scores on a scale, it will be much harder to estimate the quality of the correlation with a reference.*
366
  </Note>
367
 
368
  So, once you have selected your model judge and its prompt, you'll need to do the following.
@@ -398,24 +413,39 @@ You need to decide what your threshold for acceptance is. Depending on how hard
398
  **Mitigating well known biases of LLM as judges**
399
 
400
  <Note title="Known LLM judge biases and mitigations" emoji="⚠️" variant="warning">
401
- - **Lack of internal consistency**: a judge might give you different judgments if you prompt it several times (if the temperature is not 0)
402
- - You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
403
- - **Self-preference**: they tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
404
- - You can mitigate this by using a jury
405
- - **Blindness to input perturbation**: models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale.
406
- - You can mitigate this by
407
- - asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
408
- - providing a coherent grading scale in the prompt.
409
- - **Position-bias**: they tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice
410
- - You can mitigate this by
411
- - switching answer positions randomly
412
- - computing the log-probabilities of all possible choices to get a normalized answer
413
- - **Verbosity-bias** (or length-bias): they tend to like more verbose answers
414
- - You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
415
- - **Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
416
- - However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.
417
- - **Format bias**: they tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
418
- - You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419
  </Note>
420
 
421
  **Picking correct tasks for an LLM judge**
 
8
 
9
  ### Dataset
10
 
11
+ #### Using existing data
12
+ You can use existing datasets are are, and change the prompting or metrics associated (as has been done for older evaluations to adapt them to new prompting method), but you can also aggregate datasets.
13
+
14
+ Dataset aggregation is a good approach when you want to evaluate a specific capability that isn't well-covered by a single benchmark. Rather than starting from scratch, you can combine samples from multiple existing datasets to create a targeted evaluation suite. That's for examples what the authors of the "Measuring AGI" paper did recently to try to create a new "AGI evaluation" dataset.
15
+
16
+ When aggregating datasets, pay attention to whether
17
+ - they contain redundant data (most mathematics datasets are rewrites or aggregations of the same initial problems)
18
+ - you need balanced representation across sources (you might not want one dataset to dominate and skew your evaluation) - this will also determine whether to aggregate scores across all samples or per subset
19
+ - formats and difficulty levels are compatible (typically, if creating a unified dataset, beware of mixing up samples requiring sampling or not).
20
+
21
+ <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
22
 
23
  #### Creating a dataset manually
24
 
25
  <UsingHumanAnnotators />
26
 
27
  #### Creating a dataset synthetically
28
+ **Using rule-based techniques**
29
+
30
+ If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination! For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), [ZebraLogic](https://arxiv.org/pdf/2502.01100), IFEval, or GSMTemplate among others.
31
+
32
+ Tasks which usually fit this paradigm test mathematical, logical, or coding abilities.
33
+
34
+ **Creating synthetic data with models**
35
 
36
+ If you want to create synthetic data, you usually start from a number of seed documents that will act as your ground truth. These can be internal and specific to your use cases, or available on the web and of high quality (like Wikipedia, Stack Overflow, ...). You'll then likely need to chunk your data into units of self contained meaning.
37
+
38
+ You'll then likely want a model to design questions from your data. For this, you will need to select a frontier model, and design a very good prompt asking the model to create use-case relevant questions from the provided data. It's better if you ask the model to provide the source on which it based its question.
39
+
40
+ You can also use seed prompts as examples to provide to an external modeln for it to write the prompt for your model to generate new questions, if you want to go full synthetic ^^
41
+
42
+ Once this is done, you can do an automatic validation by using a model from a different family line on your ground truth + questions + answer as a model judge.
43
+
44
+ <Note title="Always make sure that you're checking your data" emoji="⚠️" variant="warning">
45
+ No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
46
+ </Note>
47
 
48
  #### Choosing a prompt
49
  The prompt is going to define:
 
122
 
123
  If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
124
 
125
+ If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something. Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
126
+
127
+
128
+ ## The hardest part of evaluation: Scoring free form text
129
+
130
+ ### Automatically
131
+
132
+ #### Metrics
133
+ Most ways to automatically compare a string of text to a reference are match based.
134
+
135
+ The easiest but least flexible match based metrics are **exact matches** of token sequences. <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong.
136
+
137
+ The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap.
138
+ Lastly, you'll also find model-based metrics using embedding distances for similarity like **BLEURT** (it uses BERT-based learned representations trained on human judgments from WMT, providing better semantic understanding than n-gram methods, but requiring a model download and task-specific fine-tuning for optimal performance).
139
+
140
+ Once you have an accuracy score per sample, you can aggregate it across your whole set in several ways. In general, people average their results, but you can do more complex things depending on your needs. (Some metrics already come with an aggregation, like CorpusBLEU).
141
 
142
+ If your score is **binary**, look at the **precision** (critical when false positives are costly), **recall** (critical when missing positives is costly), **F1 score** (balances precision and recall, good for imbalanced data), or **MCC** (Matthews Correlation Coefficient, which works well with imbalanced datasets by considering all confusion matrix elements).
143
+ If your score is **continuous**, you can use **mean squared error** (penalizes large errors but heavily weights outliers), **mean absolute error** (more balanced than MSE), or if you assume your data should follow a specific linear regression model, you can look at measures like the **R²** or correlation coefficients like **Pearson** (for linear relationships, assumes normality) or **Spearman** (for monotonic relationships without normality assumptions).
144
 
145
+ More generally, when picking your metric and its aggregation, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc).
146
+ <Sidenote>
147
+ To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/). You'll also find a complete list of metrics and their uses in [this organisation](https://huggingface.co/evaluate-metric).
148
+ </Sidenote>
149
+
150
+
151
+ <Note title="Pros and cons of using automated metrics">
152
+ Automated benchmarks have the following advantages:
153
+ - **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
154
+ - **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
155
+ - **Understandability**: Most automated metrics are very understandable.
156
+
157
+ However, they also present have a **reduced use on more complex tasks**: an automatic metric either requires you to have a perfect, unique and unambiguous reference/gold, like for tasks where performance is easy to define and assess (for example, classification of toxicity, knowledge questions with a single answer). More complex capabilities, on the other hand, are harder to decompose into a single and simple answer.
158
+ </Note>
159
 
160
+ #### Normalization
161
 
162
+ Normalization means changing a string of characters to have it fit a specific reference format. For example, when comparing a model prediction to a reference, you usually don't want to penalize extra spacing in the prediction, or added punctuation or capitalisation. That's why you normalize your prediction.
163
+
164
+ They are vital for specific tasks, such as math evaluations, where you want to extract an equation from a longer prediction, and compare it to a reference.
165
  In the below table, we make a list of some issues we saw happening when extracting predictions from model outputs using SymPy naively for the MATH dataset, and how Math-Verify, a specific math parser, solved these.
166
 
167
  | 📄 Example | ❗️Issue | ✅ Math-Verify | 🛑 Naive Approach |
 
172
  | \(23\) | Failed extraction due to latex borders | `23` | None |
173
  | \((- \infty, -14) \cup (-3, \infty)\). | Failed extraction due to interval | Union(Interval.open(-oo, -14), Interval.open(-3, oo)) | None |
174
  | 100\% | Failed extraction due to invalid symbol | `1` | None |
 
175
  | 1/3 == 0.333333 | No rounding support | True | False |
176
  | sqrt(1/2)*7 == sqrt(0.5)*7 | No numerical evaluation support | True | False |
177
 
 
179
  Look at [this blog](https://huggingface.co/blog/math_verify_leaderboard) for more details!
180
  </Sidenote>
181
 
182
+ Normalizations can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still help provide signal at the task level.
 
183
 
184
+ They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
185
 
186
+ #### Adding sampling
187
 
188
+ When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
189
+ This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
190
 
191
+ Common sampling-based metrics are:
192
+ - **pass@k over n**: Given n generated samples, measures whether at least k passes the test. <Sidenote> You'll find two functions for this metric: computed as: $\text{pass}@k = \mathbb{E}[\text{at least 1 correct among k samples}]$, or computed with an unbiased estimator with: $\text{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$ where c is the number of correct samples among n total samples. </Sidenote>
193
+ - **maj@n** (majority voting): Sample n generations and take the most frequent answer. This helps filter out spurious outputs and works particularly well when the model's correct reasoning path is more consistent than its errors. Commonly used for math and reasoning tasks.
194
+ - **cot@n** (chain-of-thought sampling): Sample n reasoning traces and evaluate them. Can be combined with majority voting or a pass@k (sample n reasoning chains, extract final answers, take majority or a threshold).
195
+ - **avg@n** (stable average score): Average the scores across n samples. It's a more stable estimator of performance than using "best" or "most common" case.
196
 
197
+ When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
 
198
 
199
+ <Note title="When can you use sampling and when shouldn't you?">
200
+ **For training evaluation/ablations**: Generally avoid sampling metrics as they're expensive and add variance. Stick to greedy decoding with a fixed seed.
201
+ **For post-training evaluation**: Sampling metrics can reveal capabilities that greedy decoding misses (especially for more complex tasks requiring reasoning, math or code).
202
+ **At inference**: ✅ These metrics help estimate how much improvement you can get from sampling multiple times at inference. It's particularly cool when you want to study how far you can push small models with test time compute.
 
 
 
203
 
204
+ However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
205
+ </Note>
206
 
207
+ #### Using functional testing
208
+ Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
 
 
209
 
210
+ **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
 
 
 
 
 
211
 
212
+ For instance, instructions might specify:
213
+ - *"Include exactly 3 bullet points"* verify the output contains exactly 3 bullets
214
+ - *"Capitalize only the first sentence"* parse and check capitalization patterns
215
+ - *"Use the word 'algorithm' at least twice"* count word occurrences
216
+ - *"Your response must be in JSON format with keys 'answer' and 'reasoning'"* validate JSON structure
 
217
 
218
+ Each constraint can be checked with a specific rule-based verifier, making these evaluations more unambiguous, interpretable, fast, and considerably less costly than using models as judges.
 
219
 
220
+ This functional approach works particularly well for instruction following, but requires creativity to extend to other text properties. The key is identifying aspects of text that can be verified programmatically rather than through semantic comparison.
 
 
 
221
 
222
+ <Sidenote>
223
+ Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
224
+ </Sidenote>
225
 
 
226
 
227
  ### With humans
228
  Human evaluation is simply asking humans to score predictions.
229
 
230
+ Human evaluation is very interesting, because of its **flexibility** (if you define clearly enough what you are evaluating, you can get scores for about anything!), **inherent un-contamination** (if humans write new questions to test your system, they should not be present in your training data, hopefully), and **good correlation with human preference** for obvious reasons.
231
 
232
  <Sidenote>
233
+ However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
234
  </Sidenote>
235
 
236
+ There are 3 main ways to do evaluation with paid annotators. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
 
 
 
 
 
 
 
 
 
237
 
238
  Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
239
  However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
 
247
  it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
248
  </Sidenote>
249
 
250
+ Overall, however, human evaluation has a number of well known biases:
251
+ - **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
252
+ - **Tone bias**: Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.)
253
+ - **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
254
+ - **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
255
 
256
  ### With judge models
257
  Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
 
261
  Model as judges allow to score text on complex and nuanced properties.
262
  For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
263
 
 
 
264
  They are used on 3 main tasks:
265
  - *Scoring a model generation*, on a provided scale, to assess a property of the text (fluency, toxicity, coherence, persuasiveness, etc).
266
  - *Pairwise scoring*: comparing a pair model outputs to pick the best text with respect to a given property
267
  - *Computing the similarity* between a model output and a reference
268
 
269
+ <Sidenote> In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md)) </Sidenote>
270
 
271
  #### Pros and cons of using judge-LLMs
272
  People in favor of judge LLMs have been claiming they provide better:
273
  - **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner (theoretically - in my opinion, they add more subtle bias than they are worth)
274
  - **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data (if you control for temperature).
275
  - **Cost**: They are cheap to instantiate, as they don't require to train a new model, and can just rely on good prompting and an existing high quality LLM. They are also cheaper than paying actual human annotators (capitalism...).
 
276
 
277
+ In my opinion, using LLM judges correctly is extremely tricky, and it's **easy to be deceived for critical use cases**:
278
+ - LLM as judges seem objective, but they have many **hidden biases** that can be harder to detect than the ones in humans, since we're not as actively looking for them (see below). Besides, there are ways to reduce human bias by designing survey questions in specific and statistically robust ways (which has been studied in sociology for about a century), where LLM-prompting is not as robust yet. Using LLMs to evaluate LLMs has been compared to creating an echo-chamber effect, by reinforcing biases subtly.
279
  - They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
280
  - They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
281
 
282
+ This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
 
 
 
 
 
 
 
 
 
283
 
284
+ <Note title="Getting started with an LLM judge">
285
+ If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) on how to setup your first LLM as judge!
286
 
 
287
  You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.
288
+ </Note>
289
 
290
  #### Getting a Judge-Model
291
 
292
+ When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4), [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
293
 
294
  **Using a generalist LLM**
295
 
 
317
 
318
  You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
319
 
320
+ Some existing models as of 2024 were Flow-Judge-v0.1 ([weights](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)), 3.8B parameters, a Phi-3.5-mini-instruct fine-tuned on a synthetic preference dataset, Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset, and JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models. Newer alternatives surely exist!
 
 
 
321
 
322
  **Training your own**
323
+ You can also make the choice to train or fine-tune your own LLM-as-judge. (I would avoid doing this, unless you are on a very niche domain).
324
 
325
+ If you go in that direction, you'll first need to gather preference data for your task of interest, which can come
326
  - From existing [human preference datasets](https://www.kaggle.com/competitions/lmsys-chatbot-arena)
327
  - From model generated preference data (which you can generate following the above tiny-model judges papers data sections, or get directly, for example from the Prometheus [preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection) and [feedback](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection) collections).
328
 
329
+ Then you need to decide whether to start from a small model to train from scratch, or from an existing model, that you can distill into a new smaller model, or quantize, then fine-tune (using peft or adapter weights if the model is big and your training compute low) using the above data.
330
+
331
+ <Sidenote> Apparently [starting from a reward model works better than from an instruct model](https://x.com/dk21/status/1826292289930674590) </Sidenote>
 
 
332
 
333
  #### Designing your evaluation prompt
334
 
335
  Once you've selected your model, you need to define what is the best possible prompt for your task.
336
 
337
+ <Note title="Prompt design guidelines" emoji="📝" variant="info">
338
+ Provide a clear description of the task at hand:
339
+ - `Your task is to do X`.
340
+ - `You will be provided with Y`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
341
 
342
+ Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
343
+ - `You should evaluate property Z on a scale of 1 - 5, where 1 means ...`
344
+ - `You should evaluate if property Z is present in the sample Y. Property Z is present if ...`
345
+
346
+ Provide some additional "reasoning" evaluation steps:
347
+ - `To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...`
348
+
349
+ Specify the desired output format (adding fields will help consistency)
350
+ - `Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}`
351
  </Note>
352
 
353
  You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
354
 
355
+ <Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
356
+ Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
357
+
358
+ If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
359
+
360
+ Using one prompt per capability to score tends to give better and more robust results
361
+ </Note>
362
 
363
  You can also improve accuracy using the following, possibly more costly, techniques:
364
  - **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
 
370
  - You can also experiment with using one model with variations on temperature
371
  - Surprisingly, the community has found that adding stakes to the prompts (`answer correctly and you'll get a kitten`) can increase correctness. Your mileage may vary on this one, adapt to your needs.
372
 
373
+ If you are working on critical tasks (medical domain for example), make sure to use methodologies transferred from the humanities, and 1) compute inter-annotator agreement metrics to make sure your evaluators are as unbiased as possible, 2) Use proper survey design methodology when creating your scoring grid to mitigate bias. However, most people don't really want a reproducible and high quality unbiased eval, and will be happy with quick and dirty evaluation through OK-ish prompts. (Which is an OK situation to be in! Just depends on the consequences attached).
 
 
 
 
 
 
 
374
 
375
  #### Evaluating your evaluator
376
 
377
  Before using a judge-LLM in production or at scale, you want to evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
378
 
 
379
  <Note>
380
+ This will be easier to do if it predicts binary outputs, because you'll be able to interpretable classification metrics (accuracy/recall/precision). If it predicts scores on a scale, it will be much harder to estimate the quality of the correlation with a reference. Models are notoriously bad at predicting on a scale.
381
  </Note>
382
 
383
  So, once you have selected your model judge and its prompt, you'll need to do the following.
 
413
  **Mitigating well known biases of LLM as judges**
414
 
415
  <Note title="Known LLM judge biases and mitigations" emoji="⚠️" variant="warning">
416
+ **Lack of internal consistency**:
417
+
418
+ A judge might give you different judgments if you prompt it several times (if the temperature is not 0)
419
+ ➡️ You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
420
+
421
+ **Self-preference**
422
+
423
+ Models tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers
424
+ ➡️ You can mitigate this by using a jury
425
+
426
+ **Blindness to input perturbation**
427
+
428
+ Models are bad at identifying [perturbated input](https://arxiv.org/abs/2406.13439) and tangentially [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (extended experiments on this [here](https://github.com/LeonEricsson/llmjudge/blob/main/README.md)). For example, if asked to rank text quality on text where noise has been added on a consistent scale, the grades predicted do not reflect this scale.
429
+
430
+ Mitigations:
431
+ ➡️ asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)
432
+ ➡️ or providing a coherent grading scale in the prompt.
433
+
434
+ **Position-bias**.
435
+ Models tend to [favor specific answer positions](https://arxiv.org/abs/2306.05685). For example, when presented with pairwise comparisons, Claude and GPT3.5 tend to quite systematically prefer the first choice, or the second choice. Mitigations:
436
+ ➡️ switching answer positions randomly
437
+ ➡️ computing the log-probabilities of all possible choices to get a normalized answer
438
+
439
+ **Verbosity-bias** (or length-bias)
440
+ Models tend to like more verbose answers
441
+ ➡️ You can mitigate this by [accounting for the answer difference in length](https://arxiv.org/abs/2404.04475)
442
+
443
+ **Debatable consistency [with human answers](https://arxiv.org/abs/2308.15812):**
444
+ <Sidenote> However, it's also [debatable if non-expert humans are a good baseline for absolutely all evaluations](https://arxiv.org/abs/2202.06935). For some specific domains (medical, legal, mathematics, etc), relying on non-expert human annotators is as bad a baseline as using an LLM directly.</Sidenote>
445
+
446
+ **Format bias**
447
+ Models tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
448
+ ➡️ You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
449
  </Note>
450
 
451
  **Picking correct tasks for an LLM judge**
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED
@@ -11,10 +11,12 @@ You can evaluate **specific capabilities** on their own - it's usually quite int
11
 
12
  Reasoning and commonsense datasets are often “historic” datasets, built in the age of BERT and embeddings model, before the LLM craze. They were quite challenging at the time (especially because they were often adversarially built for models of the time), but now they are 1) too easy 2) contaminated/saturated, and should only be used for ablations or as pretraining evaluations. The bigger datasets also sometimes contain errors or low quality questions as they tend to have been built through Amazon Mechanical Turk in order to scale up fast and at low cost (what is now done by using LLMs to generate evaluation questions).
13
 
14
- [ARC]([https://arxiv.org/abs/1803.05457](https://arxiv.org/abs/1803.05457)) (2018) (not to confuse with ARC-AGI) is a grade school science MCQA dataset built from human tests. The choices were selected adversarially for word co-occurence systems at the time. It has several subsets, the higher quality `challenge` one is still in use today for pretraining. [WinoGrande]([https://arxiv.org/pdf/1907.10641](https://arxiv.org/pdf/1907.10641)) (2019) is a crowdsourced (mechanical turk + validation) pronoun resolution/fill in the blank dataset, using adversarial pairs of items to trick models. Both these datasets have been quite hard for models until 2022 to 2023.
15
 
16
  A number of historic datasets are looking specifically at reasoning requiring some sort of commonsense understanding and grounding. [HellaSwag]([https://arxiv.org/abs/1905.07830](https://arxiv.org/abs/1905.07830)) (2019) requires LLMs to select the correct next sentence in a list of adversarial choices, where the text comes from captions in ActivityNet and from tutorials in Wikihow. (It’s the follow up of a dataset called Swag). As most sentences come from tutorials or descriptions of activities, they often require physical commonsense grounding to solve. In the same vein, [CommonsenseQA]([https://arxiv.org/abs/1811.00937](https://arxiv.org/abs/1811.00937)) (2018) is a dataset of commonsense MCQA built from ConceptNet - annotators write questions, then use conceptually close distractors as options. [PIQA]([https://arxiv.org/abs/1911.11641](https://arxiv.org/abs/1911.11641)) (2019) is specifically looking at physical commonsense questions (created from examples from [Instructables.com](http://Instructables.com), with again adversarial choices from semantic perturbations or rewriting). [OpenBookQA]([https://arxiv.org/abs/1809.02789](https://arxiv.org/abs/1809.02789)) (2018) provides open book facts to help answer MCQA questions - however, these questions also require latent common sense knowledge.
17
 
 
 
18
  #### Knowledge
19
  The main evaluation dataset for knowledge has been [MMLU](https://arxiv.org/abs/2009.03300) (2020). It reached saturation/contamination, and after more in depth examination, a number of issues were identified: incomplete questions referring absent documents, incorrect ground truths, ambiguous questions, and blatant americano-centrism in the topics chosen. It was therefore cleaned in [MMLU-Redux](https://arxiv.org/abs/2406.04127) (2024), extended with more complex questions and more answers in [**MMLU-Pro**](https://arxiv.org/abs/2406.01574) (2024, the main replacement used by the community at the moment), and translated/annotated for cultural bias in [Global-MMLU](https://arxiv.org/abs/2412.03304) (2024). These are used mostly for pretraining evaluations and ablations.
20
 
 
11
 
12
  Reasoning and commonsense datasets are often “historic” datasets, built in the age of BERT and embeddings model, before the LLM craze. They were quite challenging at the time (especially because they were often adversarially built for models of the time), but now they are 1) too easy 2) contaminated/saturated, and should only be used for ablations or as pretraining evaluations. The bigger datasets also sometimes contain errors or low quality questions as they tend to have been built through Amazon Mechanical Turk in order to scale up fast and at low cost (what is now done by using LLMs to generate evaluation questions).
13
 
14
+ [ARC](https://arxiv.org/abs/1803.05457) (2018) (not to confuse with ARC-AGI) is a grade school science MCQA dataset built from human tests. The choices were selected adversarially for word co-occurence systems at the time. It has several subsets, the higher quality `challenge` one is still in use today for pretraining. [WinoGrande]([https://arxiv.org/pdf/1907.10641](https://arxiv.org/pdf/1907.10641)) (2019) is a crowdsourced (mechanical turk + validation) pronoun resolution/fill in the blank dataset, using adversarial pairs of items to trick models. Both these datasets have been quite hard for models until 2022 to 2023.
15
 
16
  A number of historic datasets are looking specifically at reasoning requiring some sort of commonsense understanding and grounding. [HellaSwag]([https://arxiv.org/abs/1905.07830](https://arxiv.org/abs/1905.07830)) (2019) requires LLMs to select the correct next sentence in a list of adversarial choices, where the text comes from captions in ActivityNet and from tutorials in Wikihow. (It’s the follow up of a dataset called Swag). As most sentences come from tutorials or descriptions of activities, they often require physical commonsense grounding to solve. In the same vein, [CommonsenseQA]([https://arxiv.org/abs/1811.00937](https://arxiv.org/abs/1811.00937)) (2018) is a dataset of commonsense MCQA built from ConceptNet - annotators write questions, then use conceptually close distractors as options. [PIQA]([https://arxiv.org/abs/1911.11641](https://arxiv.org/abs/1911.11641)) (2019) is specifically looking at physical commonsense questions (created from examples from [Instructables.com](http://Instructables.com), with again adversarial choices from semantic perturbations or rewriting). [OpenBookQA]([https://arxiv.org/abs/1809.02789](https://arxiv.org/abs/1809.02789)) (2018) provides open book facts to help answer MCQA questions - however, these questions also require latent common sense knowledge.
17
 
18
+ A more recent cool reasoning dataset is [Zebra Logic](https://arxiv.org/abs/2502.01100), using logic puzzles to test model reasoning capabilities. Their methods allows for infinite generation of puzzles, so little contamination.
19
+
20
  #### Knowledge
21
  The main evaluation dataset for knowledge has been [MMLU](https://arxiv.org/abs/2009.03300) (2020). It reached saturation/contamination, and after more in depth examination, a number of issues were identified: incomplete questions referring absent documents, incorrect ground truths, ambiguous questions, and blatant americano-centrism in the topics chosen. It was therefore cleaned in [MMLU-Redux](https://arxiv.org/abs/2406.04127) (2024), extended with more complex questions and more answers in [**MMLU-Pro**](https://arxiv.org/abs/2406.01574) (2024, the main replacement used by the community at the moment), and translated/annotated for cultural bias in [Global-MMLU](https://arxiv.org/abs/2412.03304) (2024). These are used mostly for pretraining evaluations and ablations.
22