Clémentine
commited on
Commit
·
8bcaee3
1
Parent(s):
c768f1d
link
Browse files
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
title: '
|
| 3 |
-
short_desc: '
|
| 4 |
emoji: 📝
|
| 5 |
colorFrom: blue
|
| 6 |
colorTo: indigo
|
|
|
|
| 1 |
---
|
| 2 |
+
title: 'Evaluation Guidebook'
|
| 3 |
+
short_desc: 'How to properly evaluate LLMs in the modern age'
|
| 4 |
emoji: 📝
|
| 5 |
colorFrom: blue
|
| 6 |
colorTo: indigo
|
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx
CHANGED
|
@@ -122,8 +122,7 @@ However, nowadays most evaluations are generative: using generations (QA, questi
|
|
| 122 |
|
| 123 |
If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
|
| 124 |
|
| 125 |
-
If you're looking at generative evaluations,
|
| 126 |
-
|
| 127 |
|
| 128 |
## The hardest part of evaluation: Scoring free form text
|
| 129 |
|
|
@@ -183,7 +182,7 @@ Normalizations can easily [be unfair if not designed well](https://huggingface.c
|
|
| 183 |
|
| 184 |
They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
|
| 185 |
|
| 186 |
-
####
|
| 187 |
|
| 188 |
When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
|
| 189 |
This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
|
|
@@ -204,7 +203,7 @@ When you use sampling evaluations, make sure to always report all sampling param
|
|
| 204 |
However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
|
| 205 |
</Note>
|
| 206 |
|
| 207 |
-
####
|
| 208 |
Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
|
| 209 |
|
| 210 |
**IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
|
|
@@ -223,7 +222,6 @@ This functional approach works particularly well for instruction following, but
|
|
| 223 |
Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
|
| 224 |
</Sidenote>
|
| 225 |
|
| 226 |
-
|
| 227 |
### With humans
|
| 228 |
Human evaluation is simply asking humans to score predictions.
|
| 229 |
|
|
@@ -511,3 +509,10 @@ On the other hand they:
|
|
| 511 |
- For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
|
| 512 |
- Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
|
| 513 |
</Note>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
|
| 124 |
|
| 125 |
+
If you're looking at generative evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
|
|
|
|
| 126 |
|
| 127 |
## The hardest part of evaluation: Scoring free form text
|
| 128 |
|
|
|
|
| 182 |
|
| 183 |
They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
|
| 184 |
|
| 185 |
+
#### Sampling
|
| 186 |
|
| 187 |
When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
|
| 188 |
This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
|
|
|
|
| 203 |
However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
|
| 204 |
</Note>
|
| 205 |
|
| 206 |
+
#### Functional scorers
|
| 207 |
Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
|
| 208 |
|
| 209 |
**IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
|
|
|
|
| 222 |
Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
|
| 223 |
</Sidenote>
|
| 224 |
|
|
|
|
| 225 |
### With humans
|
| 226 |
Human evaluation is simply asking humans to score predictions.
|
| 227 |
|
|
|
|
| 509 |
- For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
|
| 510 |
- Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
|
| 511 |
</Note>
|
| 512 |
+
|
| 513 |
+
### Calibration and confidence
|
| 514 |
+
|
| 515 |
+
When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
|
| 516 |
+
|
| 517 |
+
These confidence intervals can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.
|
| 518 |
+
|