evaluation-guidebook

Running

App Files Files Community

tfrere HF Staff commited on 10 days ago

Commit

0cb5d1f

1 Parent(s): 1f59ee1

reupdate line charts in article

Browse files

Files changed (1) hide show

app/src/content/chapters/general-knowledge/picking-your-evaluation.mdx +105 -4

app/src/content/chapters/general-knowledge/picking-your-evaluation.mdx CHANGED Viewed

@@ -35,7 +35,32 @@ One of our core requirements for a task is that it can be learned from training
 To measure this, we used the **Spearman rank correlation** to quantify the correlation between steps and score. Spearman rank correlation can capture monotonicity even when scores don't evolve linearly with the number of steps. We required each task to have at least an average correlation of 0.5 over all model training runs.
-<HtmlEmbed src="finetasks-monotonicity.html" frameless={true} />
 #### Low noise
@@ -51,7 +76,33 @@ For each task, we computed:
 We aimed for each task to have an SNR > 20. The only exception to this rule are generative tasks, which typically have relatively low SNR, but are still worth including as they provide insights into how the model behaves when prompted to generate unconstrained (without answer options). In a multilingual setting, this is particularly relevant as some models trained on multiple languages can exhibit high task scores but then suddenly reply in the wrong language for generative tasks!
-<HtmlEmbed src="finetasks-snr.html" frameless={true} />
 <Note>
 Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).
@@ -64,7 +115,32 @@ Many model capabilities are acquired later in training, thus **many tasks** (esp
 We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.
-<HtmlEmbed src="finetasks-randomness.html" frameless={true} />
 #### Model Ordering Consistency
@@ -82,7 +158,32 @@ To measure this consistency in task ordering, we computed the average **Kendall'
 We had no strict minimum value requirement for this property, instead using it to establish comparisons between tasks.
 </Note>
-<HtmlEmbed src="finetasks-ordering.html" frameless={true} />
 #### Metrics

 To measure this, we used the **Spearman rank correlation** to quantify the correlation between steps and score. Spearman rank correlation can capture monotonicity even when scores don't evolve linearly with the number of steps. We required each task to have at least an average correlation of 0.5 over all model training runs.
+<HtmlEmbed
+  src="d3-two-lines-chart.html"
+  config={{
+    charts: [
+      {
+        title: "✅ Good monotonicity: mlmm_hellaswag_fra_cf [fr]",
+        language: "French",
+        task: "mlmm_hellaswag_fra_cf",
+        metric: "acc_norm_token"
+      },
+      {
+        title: "❌ Bad monotonicity: mlmm_truthfulqa_ara_cf:mc1 [ar]",
+        language: "Arabic",
+        task: "mlmm_truthfulqa_ara_cf:mc1",
+        metric: "acc_norm_token"
+      }
+    ],
+    statLabel: "Monotonicity",
+    smoothing: true,
+    smoothingWindow: 5,
+    smoothingCurve: "monotoneX",
+    xAxisLabel: "Training Tokens (billions)",
+    yAxisLabel: "Score"
+  }}
+  frameless={true}
+/>
 #### Low noise
 We aimed for each task to have an SNR > 20. The only exception to this rule are generative tasks, which typically have relatively low SNR, but are still worth including as they provide insights into how the model behaves when prompted to generate unconstrained (without answer options). In a multilingual setting, this is particularly relevant as some models trained on multiple languages can exhibit high task scores but then suddenly reply in the wrong language for generative tasks!
+<HtmlEmbed
+  src="d3-two-lines-chart.html"
+  config={{
+    charts: [
+      {
+        title: "✅ Good SNR: xstory_cloze_tel_cf [te]",
+        language: "Telugu",
+        task: "xstory_cloze_tel_cf",
+        metric: "acc_norm_token"
+      },
+      {
+        title: "❌ Bad SNR: tydiqa_tel [te]",
+        language: "Telugu",
+        task: "tydiqa_tel",
+        metric: "prefix_match"
+      }
+    ],
+    statLabel: "SNR",
+    groupSeeds: false,
+    smoothing: true,
+    smoothingWindow: 5,
+    smoothingCurve: "monotoneX",
+    xAxisLabel: "Training Tokens (billions)",
+    yAxisLabel: "Score"
+  }}
+  frameless={true}
+/>
 <Note>
 Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).
 We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.
+<HtmlEmbed
+  src="d3-two-lines-chart.html"
+  config={{
+    charts: [
+      {
+        title: "✅ Non-random: agieval_zho_cf/acc_pmi [zh]",
+        language: "Chinese",
+        task: "agieval_zho_cf:_average",
+        metric: "acc_norm_pmi"
+      },
+      {
+        title: "❌ Random perf: agieval_zho_cf/acc [zh]",
+        language: "Chinese",
+        task: "agieval_zho_cf:_average",
+        metric: "acc"
+      }
+    ],
+    statLabel: "Non-Randomness",
+    smoothing: true,
+    smoothingWindow: 5,
+    smoothingCurve: "monotoneX",
+    xAxisLabel: "Training Tokens (billions)",
+    yAxisLabel: "Score"
+  }}
+  frameless={true}
+/>
 #### Model Ordering Consistency
 We had no strict minimum value requirement for this property, instead using it to establish comparisons between tasks.
 </Note>
+<HtmlEmbed
+  src="d3-two-lines-chart.html"
+  config={{
+    charts: [
+      {
+        title: "✅ Good ordering: xcsqa_ara_cf [ar]",
+        language: "Arabic",
+        task: "xcsqa_ara_cf",
+        metric: "acc_norm_token"
+      },
+      {
+        title: "❌ Bad ordering: thai_exams_tha_cf [th]",
+        language: "Thai",
+        task: "thai_exams_tha_cf:_average",
+        metric: "acc_norm_token"
+      }
+    ],
+    statLabel: "Kendall's Tau",
+    smoothing: true,
+    smoothingWindow: 5,
+    smoothingCurve: "monotoneX",
+    xAxisLabel: "Training Tokens (billions)",
+    yAxisLabel: "Score"
+  }}
+  frameless={true}
+/>
 #### Metrics