Clémentine commited on
Commit
7816b35
·
1 Parent(s): 517d5ef

added epochai's latest report

Browse files
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -24,6 +24,8 @@ When aggregating datasets, pay attention to whether
24
 
25
  <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
26
 
 
 
27
  <UsingHumanAnnotators />
28
 
29
  #### Creating a dataset synthetically
 
24
 
25
  <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
26
 
27
+ New research by EpochAI (2025) showcases how to [best aggregate benchmarks together under a single framework](https://epoch.ai/blog/a-rosetta-stone-for-ai-benchmarks) to make the aggregated dataset harder overall and less prone to saturation.
28
+
29
  <UsingHumanAnnotators />
30
 
31
  #### Creating a dataset synthetically