evaluation-guidebook

Running

Clémentine commited on 8 days ago

Commit

7816b35

1 Parent(s): 517d5ef

added epochai's latest report

Files changed (1) hide show

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -24,6 +24,8 @@ When aggregating datasets, pay attention to whether
 <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically

 <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
+New research by EpochAI (2025) showcases how to [best aggregate benchmarks together under a single framework](https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks) to make the aggregated dataset harder overall and less prone to saturation.
 <UsingHumanAnnotators />
 #### Creating a dataset synthetically