Clémentine
commited on
Commit
·
7816b35
1
Parent(s):
517d5ef
added epochai's latest report
Browse files
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx
CHANGED
|
@@ -24,6 +24,8 @@ When aggregating datasets, pay attention to whether
|
|
| 24 |
|
| 25 |
<Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
|
| 26 |
|
|
|
|
|
|
|
| 27 |
<UsingHumanAnnotators />
|
| 28 |
|
| 29 |
#### Creating a dataset synthetically
|
|
|
|
| 24 |
|
| 25 |
<Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
|
| 26 |
|
| 27 |
+
New research by EpochAI (2025) showcases how to [best aggregate benchmarks together under a single framework](https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks) to make the aggregated dataset harder overall and less prone to saturation.
|
| 28 |
+
|
| 29 |
<UsingHumanAnnotators />
|
| 30 |
|
| 31 |
#### Creating a dataset synthetically
|