evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 9 days ago

Commit

8bcaee3

1 Parent(s): c768f1d

link

Browse files

Files changed (2) hide show

README.md +2 -2
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +10 -5

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: 'Bringing paper to life: A modern template for scientific writing'
-short_desc: 'A practical journey behind training SOTA LLMs'
 emoji: 📝
 colorFrom: blue
 colorTo: indigo

 ---
+title: 'Evaluation Guidebook'
+short_desc: 'How to properly evaluate LLMs in the modern age'
 emoji: 📝
 colorFrom: blue
 colorTo: indigo

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -122,8 +122,7 @@ However, nowadays most evaluations are generative: using generations (QA, questi
 If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
-If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something. Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
 ## The hardest part of evaluation: Scoring free form text
@@ -183,7 +182,7 @@ Normalizations can easily [be unfair if not designed well](https://huggingface.c
 They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
-#### Adding sampling
 When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
 This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
@@ -204,7 +203,7 @@ When you use sampling evaluations, make sure to always report all sampling param
 However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
 </Note>
-#### Using functional testing
 Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
 **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
@@ -223,7 +222,6 @@ This functional approach works particularly well for instruction following, but
 Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
 </Sidenote>
 ### With humans
 Human evaluation is simply asking humans to score predictions.
@@ -511,3 +509,10 @@ On the other hand they:
 - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
 - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
 </Note>

 If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
+If you're looking at generative evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
 ## The hardest part of evaluation: Scoring free form text
 They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
+#### Sampling
 When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
 This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
 However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
 </Note>
+#### Functional scorers
 Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
 **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
 Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
 </Sidenote>
 ### With humans
 Human evaluation is simply asking humans to score predictions.
 - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
 - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
 </Note>
+### Calibration and confidence
+When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
+These confidence intervals can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.