Clémentine commited on
Commit
8bcaee3
·
1 Parent(s): c768f1d
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: 'Bringing paper to life: A modern template for scientific writing'
3
- short_desc: 'A practical journey behind training SOTA LLMs'
4
  emoji: 📝
5
  colorFrom: blue
6
  colorTo: indigo
 
1
  ---
2
+ title: 'Evaluation Guidebook'
3
+ short_desc: 'How to properly evaluate LLMs in the modern age'
4
  emoji: 📝
5
  colorFrom: blue
6
  colorTo: indigo
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -122,8 +122,7 @@ However, nowadays most evaluations are generative: using generations (QA, questi
122
 
123
  If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
124
 
125
- If you're looking at generative evaluations, you'll need to decide if you compare generations as they are, or first normalize them with something. Then, you'll need to select what to use to score your prediction, and this is where it gets trickyyy, so let's jump to the next chapter specifically on this!
126
-
127
 
128
  ## The hardest part of evaluation: Scoring free form text
129
 
@@ -183,7 +182,7 @@ Normalizations can easily [be unfair if not designed well](https://huggingface.c
183
 
184
  They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
185
 
186
- #### Adding sampling
187
 
188
  When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
189
  This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
@@ -204,7 +203,7 @@ When you use sampling evaluations, make sure to always report all sampling param
204
  However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
205
  </Note>
206
 
207
- #### Using functional testing
208
  Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
209
 
210
  **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
@@ -223,7 +222,6 @@ This functional approach works particularly well for instruction following, but
223
  Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
224
  </Sidenote>
225
 
226
-
227
  ### With humans
228
  Human evaluation is simply asking humans to score predictions.
229
 
@@ -511,3 +509,10 @@ On the other hand they:
511
  - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
512
  - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
513
  </Note>
 
 
 
 
 
 
 
 
122
 
123
  If you are looking at **log-probabilities**, your metrics are going to be easy: you'll likely want to look at a variant of accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
124
 
125
+ If you're looking at generative evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
 
126
 
127
  ## The hardest part of evaluation: Scoring free form text
128
 
 
182
 
183
  They are also be important for evaluation of predictions generated with chain of thought, or reasoning, as you'll need to remove the reasoning trace (which is not part of the final answer) from the output to get the actual answer.
184
 
185
+ #### Sampling
186
 
187
  When models generate outputs, sampling multiple times and aggregating results can provide a more robust signal than a single greedy generation.
188
  This is particularly important for complex reasoning tasks where models may arrive at correct answers through different paths.
 
203
  However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
204
  </Note>
205
 
206
+ #### Functional scorers
207
  Instead of comparing generated text to a reference through fuzzy string matching, functional testing evaluates whether outputs satisfy specific verifiable constraints. This approach is extremely promising because it's more flexible and allows "infinite" updates of the test case through rule-based generation (which reduces overfitting).
208
 
209
  **IFEval and IFBench** are excellent examples of this approach for instruction following evaluation. Rather than asking "does this text match a reference answer?", they ask "does this text satisfy formatting constraints given in the instructions?"
 
222
  Functional testing is inspired by code evaluation, where functional testing through unit tests is standard practice (checking if generated code produces correct outputs for given inputs).
223
  </Sidenote>
224
 
 
225
  ### With humans
226
  Human evaluation is simply asking humans to score predictions.
227
 
 
509
  - For reward models that rate single prompts and completions, you can cache the scores of many reference models and easily see how a new model performs.
510
  - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
511
  </Note>
512
+
513
+ ### Calibration and confidence
514
+
515
+ When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
516
+
517
+ These confidence intervals can be obtained from standard deviations over the scores or [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) - for automatic metrics, this is relatively trivial - for model judges, a [recent paper](https://arxiv.org/pdf/2511.21140) suggested bias correction with estimators. For human based evaluations, you should report agreement.
518
+