Clémentine
commited on
Commit
·
517d5ef
1
Parent(s):
5c8d875
comments on contamination + saturation
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -47,7 +47,21 @@ Now that you have an idea of why evaluation is important to different people, le
|
|
| 47 |
|
| 48 |
## Evaluating with existing benchmarks
|
| 49 |
|
| 50 |
-
Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
### Benchmarks to know in 2025
|
| 53 |
|
|
|
|
| 47 |
|
| 48 |
## Evaluating with existing benchmarks
|
| 49 |
|
| 50 |
+
Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team.
|
| 51 |
+
|
| 52 |
+
<Note title="Important concepts" emoji="⚠️" variant="info">
|
| 53 |
+
In this section, you'll see two concepts mentionned quite a lot: contamination and saturation.
|
| 54 |
+
|
| 55 |
+
**Saturation** is when model performance on a benchmark passes human performance. More generally, the term is used for datasets that are no longer considered useful, as they have lost discriminative power between models.
|
| 56 |
+
<Sidenote> It's what you observe in the banner picture! </Sidenote>
|
| 57 |
+
|
| 58 |
+
*If all models have close to the highest possible score on your evaluation, it's no longer a discriminative benchmark. It's similar to evaluating high school students on pre-school problems: success tells you nothing (though failure is indicative).*
|
| 59 |
+
|
| 60 |
+
**Contamination** is when an evaluation dataset ended up in the training dataset of models, in which case the performance of models is artificially inflated, and does not reflect real world performance on the task.
|
| 61 |
+
|
| 62 |
+
*It's a bit like evaluating a student on questions it already knows in advance.*
|
| 63 |
+
|
| 64 |
+
</Note>
|
| 65 |
|
| 66 |
### Benchmarks to know in 2025
|
| 67 |
|
app/src/content/chapters/general-knowledge/2025-evaluations-for-useful-models.mdx
CHANGED
|
@@ -5,7 +5,11 @@ title: "2025 evaluations"
|
|
| 5 |
import Note from "../../../components/Note.astro";
|
| 6 |
import Sidenote from "../../../components/Sidenote.astro";
|
| 7 |
|
| 8 |
-
You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
#### Reasoning and commonsense
|
| 11 |
|
|
@@ -155,7 +159,9 @@ A similar approach is used to generate questions in [Arbitrage](https://arxiv.or
|
|
| 155 |
|
| 156 |
In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
|
| 157 |
|
| 158 |
-
|
|
|
|
|
|
|
| 159 |
The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
|
| 160 |
|
| 161 |
As of Nov 2025, I recommend using:
|
|
@@ -170,4 +176,8 @@ The field is moving toward evaluations that test capability orchestration rather
|
|
| 170 |
<Sidenote>
|
| 171 |
I hope the field moves towards putting more emphasis on functional testing rather than model judges, and generally understandable datasets and tasks.
|
| 172 |
</Sidenote>
|
| 173 |
-
</Note>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
import Note from "../../../components/Note.astro";
|
| 6 |
import Sidenote from "../../../components/Sidenote.astro";
|
| 7 |
|
| 8 |
+
You can evaluate **specific capabilities** on their own - it's usually quite interesting to get signal when training, or when comparing base/pretrained models. (However, if you select and validate your training methods with the following evaluations, reporting on them on the final model is slightly biased as you have already oriented your training method towards good results on them).
|
| 9 |
+
|
| 10 |
+
<Note>
|
| 11 |
+
Feel free to skim this section if you're not very familiar with evaluation yet, and come back to it once you need to find a dataset for a specific capability :)
|
| 12 |
+
</Note>
|
| 13 |
|
| 14 |
#### Reasoning and commonsense
|
| 15 |
|
|
|
|
| 159 |
|
| 160 |
In a similar vein, you'll also find arenas where LLMs are provided with money to actively trade on financial markets (like Alpha Arena or Trading Agents) - these experiments are less likely to give meaningful results, as, because of their costs, they tend to be run once per model only, so you get no statistical significance there.
|
| 161 |
|
| 162 |
+
#### Recommendations
|
| 163 |
+
|
| 164 |
+
<Note title="TLDR" emoji="🎯" variant="info">
|
| 165 |
The landscape of evaluation has evolved with the jumps in capabilities, from testing isolated skills to measuring integrated performance in more realistic scenarios.
|
| 166 |
|
| 167 |
As of Nov 2025, I recommend using:
|
|
|
|
| 176 |
<Sidenote>
|
| 177 |
I hope the field moves towards putting more emphasis on functional testing rather than model judges, and generally understandable datasets and tasks.
|
| 178 |
</Sidenote>
|
| 179 |
+
</Note>
|
| 180 |
+
|
| 181 |
+
<Note>
|
| 182 |
+
If you want to explore even more datasets, you'll find a big list of older interesting benchmarks [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md) with my notes.
|
| 183 |
+
</Note>
|