reupdate line charts in article
Browse files
app/src/content/chapters/general-knowledge/picking-your-evaluation.mdx
CHANGED
|
@@ -35,7 +35,32 @@ One of our core requirements for a task is that it can be learned from training
|
|
| 35 |
To measure this, we used the **Spearman rank correlation** to quantify the correlation between steps and score. Spearman rank correlation can capture monotonicity even when scores don't evolve linearly with the number of steps. We required each task to have at least an average correlation of 0.5 over all model training runs.
|
| 36 |
|
| 37 |
|
| 38 |
-
<HtmlEmbed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
#### Low noise
|
| 41 |
|
|
@@ -51,7 +76,33 @@ For each task, we computed:
|
|
| 51 |
|
| 52 |
We aimed for each task to have an SNR > 20. The only exception to this rule are generative tasks, which typically have relatively low SNR, but are still worth including as they provide insights into how the model behaves when prompted to generate unconstrained (without answer options). In a multilingual setting, this is particularly relevant as some models trained on multiple languages can exhibit high task scores but then suddenly reply in the wrong language for generative tasks!
|
| 53 |
|
| 54 |
-
<HtmlEmbed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
<Note>
|
| 57 |
Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).
|
|
@@ -64,7 +115,32 @@ Many model capabilities are acquired later in training, thus **many tasks** (esp
|
|
| 64 |
We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.
|
| 65 |
|
| 66 |
|
| 67 |
-
<HtmlEmbed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
#### Model Ordering Consistency
|
| 70 |
|
|
@@ -82,7 +158,32 @@ To measure this consistency in task ordering, we computed the average **Kendall'
|
|
| 82 |
We had no strict minimum value requirement for this property, instead using it to establish comparisons between tasks.
|
| 83 |
</Note>
|
| 84 |
|
| 85 |
-
<HtmlEmbed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
|
| 88 |
#### Metrics
|
|
|
|
| 35 |
To measure this, we used the **Spearman rank correlation** to quantify the correlation between steps and score. Spearman rank correlation can capture monotonicity even when scores don't evolve linearly with the number of steps. We required each task to have at least an average correlation of 0.5 over all model training runs.
|
| 36 |
|
| 37 |
|
| 38 |
+
<HtmlEmbed
|
| 39 |
+
src="d3-two-lines-chart.html"
|
| 40 |
+
config={{
|
| 41 |
+
charts: [
|
| 42 |
+
{
|
| 43 |
+
title: "β
Good monotonicity: mlmm_hellaswag_fra_cf [fr]",
|
| 44 |
+
language: "French",
|
| 45 |
+
task: "mlmm_hellaswag_fra_cf",
|
| 46 |
+
metric: "acc_norm_token"
|
| 47 |
+
},
|
| 48 |
+
{
|
| 49 |
+
title: "β Bad monotonicity: mlmm_truthfulqa_ara_cf:mc1 [ar]",
|
| 50 |
+
language: "Arabic",
|
| 51 |
+
task: "mlmm_truthfulqa_ara_cf:mc1",
|
| 52 |
+
metric: "acc_norm_token"
|
| 53 |
+
}
|
| 54 |
+
],
|
| 55 |
+
statLabel: "Monotonicity",
|
| 56 |
+
smoothing: true,
|
| 57 |
+
smoothingWindow: 5,
|
| 58 |
+
smoothingCurve: "monotoneX",
|
| 59 |
+
xAxisLabel: "Training Tokens (billions)",
|
| 60 |
+
yAxisLabel: "Score"
|
| 61 |
+
}}
|
| 62 |
+
frameless={true}
|
| 63 |
+
/>
|
| 64 |
|
| 65 |
#### Low noise
|
| 66 |
|
|
|
|
| 76 |
|
| 77 |
We aimed for each task to have an SNR > 20. The only exception to this rule are generative tasks, which typically have relatively low SNR, but are still worth including as they provide insights into how the model behaves when prompted to generate unconstrained (without answer options). In a multilingual setting, this is particularly relevant as some models trained on multiple languages can exhibit high task scores but then suddenly reply in the wrong language for generative tasks!
|
| 78 |
|
| 79 |
+
<HtmlEmbed
|
| 80 |
+
src="d3-two-lines-chart.html"
|
| 81 |
+
config={{
|
| 82 |
+
charts: [
|
| 83 |
+
{
|
| 84 |
+
title: "β
Good SNR: xstory_cloze_tel_cf [te]",
|
| 85 |
+
language: "Telugu",
|
| 86 |
+
task: "xstory_cloze_tel_cf",
|
| 87 |
+
metric: "acc_norm_token"
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
title: "β Bad SNR: tydiqa_tel [te]",
|
| 91 |
+
language: "Telugu",
|
| 92 |
+
task: "tydiqa_tel",
|
| 93 |
+
metric: "prefix_match"
|
| 94 |
+
}
|
| 95 |
+
],
|
| 96 |
+
statLabel: "SNR",
|
| 97 |
+
groupSeeds: false,
|
| 98 |
+
smoothing: true,
|
| 99 |
+
smoothingWindow: 5,
|
| 100 |
+
smoothingCurve: "monotoneX",
|
| 101 |
+
xAxisLabel: "Training Tokens (billions)",
|
| 102 |
+
yAxisLabel: "Score"
|
| 103 |
+
}}
|
| 104 |
+
frameless={true}
|
| 105 |
+
/>
|
| 106 |
|
| 107 |
<Note>
|
| 108 |
Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).
|
|
|
|
| 115 |
We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.
|
| 116 |
|
| 117 |
|
| 118 |
+
<HtmlEmbed
|
| 119 |
+
src="d3-two-lines-chart.html"
|
| 120 |
+
config={{
|
| 121 |
+
charts: [
|
| 122 |
+
{
|
| 123 |
+
title: "β
Non-random: agieval_zho_cf/acc_pmi [zh]",
|
| 124 |
+
language: "Chinese",
|
| 125 |
+
task: "agieval_zho_cf:_average",
|
| 126 |
+
metric: "acc_norm_pmi"
|
| 127 |
+
},
|
| 128 |
+
{
|
| 129 |
+
title: "β Random perf: agieval_zho_cf/acc [zh]",
|
| 130 |
+
language: "Chinese",
|
| 131 |
+
task: "agieval_zho_cf:_average",
|
| 132 |
+
metric: "acc"
|
| 133 |
+
}
|
| 134 |
+
],
|
| 135 |
+
statLabel: "Non-Randomness",
|
| 136 |
+
smoothing: true,
|
| 137 |
+
smoothingWindow: 5,
|
| 138 |
+
smoothingCurve: "monotoneX",
|
| 139 |
+
xAxisLabel: "Training Tokens (billions)",
|
| 140 |
+
yAxisLabel: "Score"
|
| 141 |
+
}}
|
| 142 |
+
frameless={true}
|
| 143 |
+
/>
|
| 144 |
|
| 145 |
#### Model Ordering Consistency
|
| 146 |
|
|
|
|
| 158 |
We had no strict minimum value requirement for this property, instead using it to establish comparisons between tasks.
|
| 159 |
</Note>
|
| 160 |
|
| 161 |
+
<HtmlEmbed
|
| 162 |
+
src="d3-two-lines-chart.html"
|
| 163 |
+
config={{
|
| 164 |
+
charts: [
|
| 165 |
+
{
|
| 166 |
+
title: "β
Good ordering: xcsqa_ara_cf [ar]",
|
| 167 |
+
language: "Arabic",
|
| 168 |
+
task: "xcsqa_ara_cf",
|
| 169 |
+
metric: "acc_norm_token"
|
| 170 |
+
},
|
| 171 |
+
{
|
| 172 |
+
title: "β Bad ordering: thai_exams_tha_cf [th]",
|
| 173 |
+
language: "Thai",
|
| 174 |
+
task: "thai_exams_tha_cf:_average",
|
| 175 |
+
metric: "acc_norm_token"
|
| 176 |
+
}
|
| 177 |
+
],
|
| 178 |
+
statLabel: "Kendall's Tau",
|
| 179 |
+
smoothing: true,
|
| 180 |
+
smoothingWindow: 5,
|
| 181 |
+
smoothingCurve: "monotoneX",
|
| 182 |
+
xAxisLabel: "Training Tokens (billions)",
|
| 183 |
+
yAxisLabel: "Score"
|
| 184 |
+
}}
|
| 185 |
+
frameless={true}
|
| 186 |
+
/>
|
| 187 |
|
| 188 |
|
| 189 |
#### Metrics
|