tfrere HF Staff commited on
Commit
0cb5d1f
Β·
1 Parent(s): 1f59ee1

reupdate line charts in article

Browse files
app/src/content/chapters/general-knowledge/picking-your-evaluation.mdx CHANGED
@@ -35,7 +35,32 @@ One of our core requirements for a task is that it can be learned from training
35
  To measure this, we used the **Spearman rank correlation** to quantify the correlation between steps and score. Spearman rank correlation can capture monotonicity even when scores don't evolve linearly with the number of steps. We required each task to have at least an average correlation of 0.5 over all model training runs.
36
 
37
 
38
- <HtmlEmbed src="finetasks-monotonicity.html" frameless={true} />
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  #### Low noise
41
 
@@ -51,7 +76,33 @@ For each task, we computed:
51
 
52
  We aimed for each task to have an SNR > 20. The only exception to this rule are generative tasks, which typically have relatively low SNR, but are still worth including as they provide insights into how the model behaves when prompted to generate unconstrained (without answer options). In a multilingual setting, this is particularly relevant as some models trained on multiple languages can exhibit high task scores but then suddenly reply in the wrong language for generative tasks!
53
 
54
- <HtmlEmbed src="finetasks-snr.html" frameless={true} />
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  <Note>
57
  Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).
@@ -64,7 +115,32 @@ Many model capabilities are acquired later in training, thus **many tasks** (esp
64
  We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.
65
 
66
 
67
- <HtmlEmbed src="finetasks-randomness.html" frameless={true} />
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
  #### Model Ordering Consistency
70
 
@@ -82,7 +158,32 @@ To measure this consistency in task ordering, we computed the average **Kendall'
82
  We had no strict minimum value requirement for this property, instead using it to establish comparisons between tasks.
83
  </Note>
84
 
85
- <HtmlEmbed src="finetasks-ordering.html" frameless={true} />
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
 
88
  #### Metrics
 
35
  To measure this, we used the **Spearman rank correlation** to quantify the correlation between steps and score. Spearman rank correlation can capture monotonicity even when scores don't evolve linearly with the number of steps. We required each task to have at least an average correlation of 0.5 over all model training runs.
36
 
37
 
38
+ <HtmlEmbed
39
+ src="d3-two-lines-chart.html"
40
+ config={{
41
+ charts: [
42
+ {
43
+ title: "βœ… Good monotonicity: mlmm_hellaswag_fra_cf [fr]",
44
+ language: "French",
45
+ task: "mlmm_hellaswag_fra_cf",
46
+ metric: "acc_norm_token"
47
+ },
48
+ {
49
+ title: "❌ Bad monotonicity: mlmm_truthfulqa_ara_cf:mc1 [ar]",
50
+ language: "Arabic",
51
+ task: "mlmm_truthfulqa_ara_cf:mc1",
52
+ metric: "acc_norm_token"
53
+ }
54
+ ],
55
+ statLabel: "Monotonicity",
56
+ smoothing: true,
57
+ smoothingWindow: 5,
58
+ smoothingCurve: "monotoneX",
59
+ xAxisLabel: "Training Tokens (billions)",
60
+ yAxisLabel: "Score"
61
+ }}
62
+ frameless={true}
63
+ />
64
 
65
  #### Low noise
66
 
 
76
 
77
  We aimed for each task to have an SNR > 20. The only exception to this rule are generative tasks, which typically have relatively low SNR, but are still worth including as they provide insights into how the model behaves when prompted to generate unconstrained (without answer options). In a multilingual setting, this is particularly relevant as some models trained on multiple languages can exhibit high task scores but then suddenly reply in the wrong language for generative tasks!
78
 
79
+ <HtmlEmbed
80
+ src="d3-two-lines-chart.html"
81
+ config={{
82
+ charts: [
83
+ {
84
+ title: "βœ… Good SNR: xstory_cloze_tel_cf [te]",
85
+ language: "Telugu",
86
+ task: "xstory_cloze_tel_cf",
87
+ metric: "acc_norm_token"
88
+ },
89
+ {
90
+ title: "❌ Bad SNR: tydiqa_tel [te]",
91
+ language: "Telugu",
92
+ task: "tydiqa_tel",
93
+ metric: "prefix_match"
94
+ }
95
+ ],
96
+ statLabel: "SNR",
97
+ groupSeeds: false,
98
+ smoothing: true,
99
+ smoothingWindow: 5,
100
+ smoothingCurve: "monotoneX",
101
+ xAxisLabel: "Training Tokens (billions)",
102
+ yAxisLabel: "Score"
103
+ }}
104
+ frameless={true}
105
+ />
106
 
107
  <Note>
108
  Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).
 
115
  We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.
116
 
117
 
118
+ <HtmlEmbed
119
+ src="d3-two-lines-chart.html"
120
+ config={{
121
+ charts: [
122
+ {
123
+ title: "βœ… Non-random: agieval_zho_cf/acc_pmi [zh]",
124
+ language: "Chinese",
125
+ task: "agieval_zho_cf:_average",
126
+ metric: "acc_norm_pmi"
127
+ },
128
+ {
129
+ title: "❌ Random perf: agieval_zho_cf/acc [zh]",
130
+ language: "Chinese",
131
+ task: "agieval_zho_cf:_average",
132
+ metric: "acc"
133
+ }
134
+ ],
135
+ statLabel: "Non-Randomness",
136
+ smoothing: true,
137
+ smoothingWindow: 5,
138
+ smoothingCurve: "monotoneX",
139
+ xAxisLabel: "Training Tokens (billions)",
140
+ yAxisLabel: "Score"
141
+ }}
142
+ frameless={true}
143
+ />
144
 
145
  #### Model Ordering Consistency
146
 
 
158
  We had no strict minimum value requirement for this property, instead using it to establish comparisons between tasks.
159
  </Note>
160
 
161
+ <HtmlEmbed
162
+ src="d3-two-lines-chart.html"
163
+ config={{
164
+ charts: [
165
+ {
166
+ title: "βœ… Good ordering: xcsqa_ara_cf [ar]",
167
+ language: "Arabic",
168
+ task: "xcsqa_ara_cf",
169
+ metric: "acc_norm_token"
170
+ },
171
+ {
172
+ title: "❌ Bad ordering: thai_exams_tha_cf [th]",
173
+ language: "Thai",
174
+ task: "thai_exams_tha_cf:_average",
175
+ metric: "acc_norm_token"
176
+ }
177
+ ],
178
+ statLabel: "Kendall's Tau",
179
+ smoothing: true,
180
+ smoothingWindow: 5,
181
+ smoothingCurve: "monotoneX",
182
+ xAxisLabel: "Training Tokens (billions)",
183
+ yAxisLabel: "Score"
184
+ }}
185
+ frameless={true}
186
+ />
187
 
188
 
189
  #### Metrics