Clémentine commited on
Commit
517d5ef
·
1 Parent(s): 5c8d875

comments on contamination + saturation

Browse files
app/src/content/article.mdx CHANGED
@@ -47,7 +47,21 @@ Now that you have an idea of why evaluation is important to different people, le
47
 
48
  ## Evaluating with existing benchmarks
49
 
50
- Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ### Benchmarks to know in 2025
53
 
 
47
 
48
  ## Evaluating with existing benchmarks
49
 
50
+ Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team.
51
+
52
+ <Note title="Important concepts" emoji="⚠️" variant="info">
53
+ In this section, you'll see two concepts mentionned quite a lot: contamination and saturation.
54
+
55
+ **Saturation** is when model performance on a benchmark passes human performance. More generally, the term is used for datasets that are no longer considered useful, as they have lost discriminative power between models.
56
+ <Sidenote> It's what you observe in the banner picture! </Sidenote>
57
+
58
+ *If all models have close to the highest possible score on your evaluation, it's no longer a discriminative benchmark. It's similar to evaluating high school students on pre-school problems: success tells you nothing (though failure is indicative).*
59
+
60
+ **Contamination** is when an evaluation dataset ended up in the training dataset of models, in which case the performance of models is artificially inflated, and does not reflect real world performance on the task.
61
+
62
+ *It's a bit like evaluating a student on questions it already knows in advance.*
63
+
64
+ </Note>
65
 
66
  ### Benchmarks to know in 2025
67
 
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx CHANGED
@@ -5,7 +5,11 @@ title: "2025 evaluations"
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
7
 
8
- You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).
 
 
 
 
9
 
10
  #### Reasoning and commonsense
11
 
@@ -155,7 +159,9 @@ A similar approach is used to generate questions in [Arbitrage](https://arxiv.or
155
 
156
  In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
157
 
158
- <Note title="TLDR" emoji="🎯">
 
 
159
  The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
160
 
161
  As of Nov 2025, I recommend using:
@@ -170,4 +176,8 @@ The field is moving toward evaluations that test capability orchestration rather
170
  <Sidenote>
171
  I hope the field moves towards putting more emphasis on functional testing rather than model judges, and generally understandable datasets and tasks.
172
  </Sidenote>
173
- </Note>
 
 
 
 
 
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
7
 
8
+ You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).
9
+
10
+ <Note>
11
+ Feel free to skim this section if you're not very familiar with evaluation yet, and come back to it once you need to find a dataset for a specific capability :)
12
+ </Note>
13
 
14
  #### Reasoning and commonsense
15
 
 
159
 
160
  In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
161
 
162
+ #### Recommendations
163
+
164
+ <Note title="TLDR" emoji="🎯" variant="info">
165
  The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
166
 
167
  As of Nov 2025, I recommend using:
 
176
  <Sidenote>
177
  I hope the field moves towards putting more emphasis on functional testing rather than model judges, and generally understandable datasets and tasks.
178
  </Sidenote>
179
+ </Note>
180
+
181
+ <Note>
182
+ If you want to explore even more datasets, you'll find a big list of older interesting benchmarks [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md) with my notes.
183
+ </Note>