Clémentine commited on
Commit
0f38400
·
1 Parent(s): 0039d7e

update eval

Browse files
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -109,6 +109,8 @@ Scoring free-form text is tricky because there are typically many different ways
109
 
110
  ### Automatically
111
 
 
 
112
  #### Metrics
113
  Most ways to automatically compare a string of text to a reference are match based.
114
 
@@ -214,19 +216,14 @@ Human evaluation is very interesting, because of its **flexibility** (if you def
214
  However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
215
  </Sidenote>
216
 
217
- There are 3 main ways to do evaluation with paid annotators. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
218
 
219
- Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
220
- However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
221
 
222
- Two other approaches exist to do human-based evaluation in a more casual way:
223
- - **Vibes-checks**: manual evaluations done by individuals to get an overall feeling of how well models perform on many use cases (from coding to quality of smut written). Often shared on Twitter and Reddit, results mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for). However, they can be [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need).
224
- - **Arenas**: crowdsourced human evaluation to rank models. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best".
225
 
226
- Pros of casual human evaluations are that they are cheap, scale better and allow to discover fun edge cases since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases. However, the obvious issues are that it's easy to game externally, you can't mitigate the **high subjectivity** easily, it's usually not representative of the broader population as since young western men are over re-represented on tech-sides of the internet (both in terms of topics explored and overall rankings).
227
- <Sidenote>
228
- it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
229
- </Sidenote>
230
 
231
  Overall, however, human evaluation has a number of well known biases:
232
  - **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
@@ -523,9 +520,11 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
523
  - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
524
  </Note>
525
 
526
- ### Calibration and confidence
527
 
528
  When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
529
 
530
- These confidence intervals can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.
 
 
531
 
 
109
 
110
  ### Automatically
111
 
112
+ When there is a ground truth, however, you can use automatic metrics, let's see how.
113
+
114
  #### Metrics
115
  Most ways to automatically compare a string of text to a reference are match based.
116
 
 
216
  However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
217
  </Sidenote>
218
 
219
+ Human evaluations can be casual, with vibe-checks or arenas, or systematic (as mentionned in the intro).
220
 
221
+ Vibe-checks are a particularly [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need), as you'll be testing the model on what's relevant to you. Pros of casual human evaluations are that they are cheap and allow to discover fun edge cases since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases. However, they can be prone to blind spots.
 
222
 
223
+ Once you want to scale to more systematic evaluation with paid annotators, you'll find that there are 3 main ways to do so. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
 
 
224
 
225
+ Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
226
+ However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
 
 
227
 
228
  Overall, however, human evaluation has a number of well known biases:
229
  - **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
 
520
  - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
521
  </Note>
522
 
523
+ ### Confidence and score reporting
524
 
525
  When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
526
 
527
+ These confidence intervals from the raw scores can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.
528
+
529
+ You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
530