evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 6 days ago

Commit

0f38400

1 Parent(s): 0039d7e

update eval

Browse files

Files changed (1) hide show

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +11 -12

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -109,6 +109,8 @@ Scoring free-form text is tricky because there are typically many different ways
 ### Automatically
 #### Metrics
 Most ways to automatically compare a string of text to a reference are match based.
@@ -214,19 +216,14 @@ Human evaluation is very interesting, because of its **flexibility** (if you def
 However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
 </Sidenote>
-There are 3 main ways to do evaluation with paid annotators. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
-Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
-However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
-Two other approaches exist to do human-based evaluation in a more casual way:
-- **Vibes-checks**: manual evaluations done by individuals to get an overall feeling of how well models perform on many use cases (from coding to quality of smut written). Often shared on Twitter and Reddit, results mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for). However, they can be [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need).
-- **Arenas**: crowdsourced human evaluation to rank models. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best".
-Pros of casual human evaluations are that they are cheap, scale better and allow to discover fun edge cases since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases. However, the obvious issues are that it's easy to game externally, you can't mitigate the **high subjectivity** easily, it's usually not representative of the broader population as since young western men are over re-represented on tech-sides of the internet (both in terms of topics explored and overall rankings).
-<Sidenote>
-it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1). One can hope that these effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (see Galton's wikipedia page).
-</Sidenote>
 Overall, however, human evaluation has a number of well known biases:
 - **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
@@ -523,9 +520,11 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
 - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
 </Note>
-### Calibration and confidence
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
-These confidence intervals can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.

 ### Automatically
+When there is a ground truth, however, you can use automatic metrics, let's see how.
 #### Metrics
 Most ways to automatically compare a string of text to a reference are match based.
 However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
 </Sidenote>
+Human evaluations can be casual, with vibe-checks or arenas, or systematic (as mentionned in the intro).
+Vibe-checks are a particularly [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need), as you'll be testing the model on what's relevant to you. Pros of casual human evaluations are that they are cheap and allow to discover fun edge cases since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases. However, they can be prone to blind spots.
+Once you want to scale to more systematic evaluation with paid annotators, you'll find that there are 3 main ways to do so. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
+Pros of systematic human evaluations, especially with paid annotators, are that you're **getting high quality and private data** adapted to your use case (especially if you rely on in house annotators), which are mostly **explainable** (scores obtained by the models will be explainable by the humans who gave them).
+However, it's more costly (especially as you'll most likely need rounds of annotations to adapt your guidelines) and does not scale well.
 Overall, however, human evaluation has a number of well known biases:
 - **First impressions bias**: Human evaluators tend to estimate the quality of answers [based on first impressions](https://arxiv.org/abs/2309.16349), instead of actual factuality or faithfulness.
 - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
 </Note>
+### Confidence and score reporting
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
+These confidence intervals from the raw scores can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.
+You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.