Clémentine commited on
Commit
7ccc792
·
1 Parent(s): 6ef2a16
app/src/content/article.mdx CHANGED
@@ -91,8 +91,6 @@ Best (but rarest) metrics are functional or based on rule based verifiers (thoug
91
 
92
  ## Creating your own evaluation
93
 
94
-
95
-
96
  <DesigningAutomaticEvaluation />
97
 
98
 
 
91
 
92
  ## Creating your own evaluation
93
 
 
 
94
  <DesigningAutomaticEvaluation />
95
 
96
 
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -20,8 +20,6 @@ When aggregating datasets, pay attention to whether
20
 
21
  <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
22
 
23
- #### Creating a dataset manually
24
-
25
  <UsingHumanAnnotators />
26
 
27
  #### Creating a dataset synthetically
@@ -45,33 +43,6 @@ Once this is done, you can do an automatic validation by using a model from a di
45
  No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
46
  </Note>
47
 
48
- #### Choosing a prompt
49
- The prompt is going to define:
50
- - how much information is given to your model about the task
51
- - how this information is presented to your model.
52
-
53
- A prompt for a general MCQA or QA is usually made of some of the following:
54
- - a task prompt (optional): introduces your task.
55
- - a context: provides additional context for your question.
56
- - *Eg: For a summarization or information extraction task, you could provide a content source*
57
- - a question: the actual core of your prompt.
58
- - in case of a multi choice evaluation, you can add options
59
- - connector words (`Question`, `Context`, `Choice`, ...)
60
-
61
- When defining your prompt, you need to be aware that:
62
- - even small changes in semantically equivalent prompts can make the results vary by quite a lot (see Section `Different prompt` in [Troubleshooting reproducibility](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/troubleshooting/troubleshooting-reproducibility.md)), and prompt formats might advantage or disadvantage specific models
63
- - How to mitigate this:
64
- - A costly way is to re-run the evaluation several times with prompt variations
65
- - A less costly way is to run your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty
66
- - you can provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall
67
- - for a number of metrics, you want a very constrained generation or output.
68
-
69
- <Note title="Models can overfit prompt formats" emoji="⚠️" variant="warning">
70
-
71
- Recent research shows models can overfit specific prompt formats rather than learning the underlying task. [This paper](https://arxiv.org/abs/2407.07890) is great on the topic, showing notably how some models can be over-evaluated because they have overfitted the test set **format**.
72
- On the Open LLM Leaderboard 2, we've notably observed that Llama 3.2 and Qwen 2.5 are no longer following the format of the prompt provided in a few-shot setup for this reason.
73
- </Note>
74
-
75
  #### Managing contamination
76
  In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
77
 
@@ -83,6 +54,21 @@ Solutions to mitigate this include:
83
 
84
  However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ### Choosing an inference method for your model
88
  You'll need to choose what kind of inference method you need.
 
20
 
21
  <Sidenote>Examples: MMLU, Big-Bench (hundreds of diverse tasks), and HELM (combines multiple existing benchmarks for holistic evaluation)</Sidenote>
22
 
 
 
23
  <UsingHumanAnnotators />
24
 
25
  #### Creating a dataset synthetically
 
43
  No matter how tempting it is to do everything automatically, you should always check your data at every step, to make sure your evaluations are qualitative. Evaluation is the name of the game and you need to use extremely good data.
44
  </Note>
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  #### Managing contamination
47
  In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
48
 
 
54
 
55
  However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
56
 
57
+ ### Choosing a prompt
58
+ The prompt is going to define how much information is given to your model about the task, and how this information is presented to the model.
59
+
60
+ A prompt for a general MCQA or QA is usually made of some of the following:
61
+ - a task prompt (optional): introduces your task.
62
+ - a context: provides additional context for your question.
63
+ - *Eg: For a summarization or information extraction task, you could provide a content source*
64
+ - a question: the actual core of your prompt.
65
+ - in case of a multi choice evaluation, you can add options
66
+ - connector words (`Question`, `Context`, `Choice`, ...)
67
+
68
+ When defining your prompt, you need to be aware that even small changes in semantically equivalent prompts can make the results vary by quite a lot, and prompt formats might advantage or disadvantage specific models (See [this section](https://huggingface.co/spaces/OpenEvals/evaluation-guidebook#different-prompt)).
69
+
70
+ ➡️ This can be mitigated by re-running the evaluation several times with prompt variations (but it can be costly), or simply running your evaluation once using a range of prompt formats allocated to different samples of equivalent difficulty.
71
+ ➡️ You can also provide examples to your model to help it follow the expected format (using few-shot examples), and adding connector words helps this overall.
72
 
73
  ### Choosing an inference method for your model
74
  You'll need to choose what kind of inference method you need.
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED
@@ -113,24 +113,28 @@ Different tokenizers behave differently with spacing and special tokens. See thi
113
 
114
  When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
115
 
116
- However, some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices is not easy, as the context tokens can "bleed out" into them, messing up the comparison.
 
117
 
118
- <Sidenote>
119
 
120
- The [Llama tokenizer](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257) doesn't satisfy `enc(context + choice) = enc(context) + enc(choice)`, making log probability comparisons tricky. Tokenize separately and concatenate, removing special tokens.
121
- </Sidenote>
 
 
 
 
122
 
123
- So if this is the case for your model, you might want to compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
124
 
125
  **Paying attention to start and end of sentence tokens**
126
 
127
- Some models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
128
 
129
  You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
130
 
131
  **Multilinguality and tokenization**
132
 
133
- When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc.
134
 
135
  **Code evaluations and end of sentence tokens**
136
 
 
113
 
114
  When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
115
 
116
+ <Note title="Should you tokenize the context with the choices always?">
117
+ Some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices only is not trivial, as the context tokens can "bleed out" into them, messing up the comparison.
118
 
119
+ To give a concrete example, say you have characters `C1`, `C2`, and `C3` as base tokens of your vocabulary, and `C1C2` also happens to be a single token learned during BPE.
120
 
121
+ Say your context is C1, and the choices C2 and C3.
122
+ If you tokenize the context with the choices, you compare `C1C2` (one token) with `C1+C3` (two tokens). Even if you normalize the logprobs by length, you are not comparing the same thing.
123
+ Comparing after tokenizing the context and choices separately means you compare `C1+C2` and `C1+C3`. But since `C1C2` is a token, the occurence of `C1+C2` is likely rare in the data your encoder saw, so it is an unlikely succession for your model, which can mess up your logprobabilities.
124
+
125
+ If this is the case for your model, the solution is usually to go for the least worst option, comparing the comparable: compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
126
+ </Note>
127
 
 
128
 
129
  **Paying attention to start and end of sentence tokens**
130
 
131
+ Some pretrained models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
132
 
133
  You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
134
 
135
  **Multilinguality and tokenization**
136
 
137
+ When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc. The number of tokens that the model is allowed to generate for an evaluation should also be language dependent, as not all languages are tokenized in similar amount of tokens (go back to the tokenization section to see why).
138
 
139
  **Code evaluations and end of sentence tokens**
140
 
app/src/content/embeds/d3-evaluation-decision-tree.html ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-evaluation-tree"></div>
2
+ <style>
3
+ .d3-evaluation-tree {
4
+ position: relative;
5
+ width: 100%;
6
+ min-height: 500px;
7
+ overflow: visible;
8
+ }
9
+ .d3-evaluation-tree svg {
10
+ display: block;
11
+ width: 100%;
12
+ height: auto;
13
+ }
14
+ .d3-evaluation-tree .node-rect {
15
+ stroke-width: 2;
16
+ rx: 8;
17
+ ry: 8;
18
+ cursor: pointer;
19
+ transition: all 0.2s ease;
20
+ }
21
+ .d3-evaluation-tree .decision-node {
22
+ stroke: var(--border-color);
23
+ }
24
+ .d3-evaluation-tree .result-node {
25
+ stroke: var(--border-color);
26
+ }
27
+ .d3-evaluation-tree .warning-node {
28
+ stroke: var(--border-color);
29
+ }
30
+ .d3-evaluation-tree .node-text {
31
+ fill: var(--text-color);
32
+ font-size: 12px;
33
+ font-weight: 500;
34
+ pointer-events: none;
35
+ user-select: none;
36
+ }
37
+ .d3-evaluation-tree .link {
38
+ fill: none;
39
+ stroke: var(--border-color);
40
+ stroke-width: 1.5;
41
+ opacity: 0.5;
42
+ }
43
+ .d3-evaluation-tree .link-label {
44
+ fill: var(--muted-color);
45
+ font-size: 10px;
46
+ font-weight: 500;
47
+ }
48
+ .d3-evaluation-tree .node-rect:hover {
49
+ filter: brightness(1.05);
50
+ stroke-width: 3;
51
+ }
52
+ .d3-evaluation-tree .d3-tooltip {
53
+ position: absolute;
54
+ top: 0;
55
+ left: 0;
56
+ transform: translate(-9999px, -9999px);
57
+ pointer-events: none;
58
+ padding: 8px 10px;
59
+ border-radius: 8px;
60
+ font-size: 12px;
61
+ line-height: 1.35;
62
+ border: 1px solid var(--border-color);
63
+ background: var(--surface-bg);
64
+ color: var(--text-color);
65
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
66
+ opacity: 0;
67
+ transition: opacity .12s ease;
68
+ max-width: 250px;
69
+ }
70
+ </style>
71
+ <script>
72
+ (() => {
73
+ const ensureD3 = (cb) => {
74
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
75
+ let s = document.getElementById('d3-cdn-script');
76
+ if (!s) {
77
+ s = document.createElement('script');
78
+ s.id = 'd3-cdn-script';
79
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
80
+ document.head.appendChild(s);
81
+ }
82
+ const onReady = () => {
83
+ if (window.d3 && typeof window.d3.select === 'function') cb();
84
+ };
85
+ s.addEventListener('load', onReady, { once: true });
86
+ if (window.d3) onReady();
87
+ };
88
+
89
+ const bootstrap = () => {
90
+ const scriptEl = document.currentScript;
91
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
92
+ if (!(container && container.classList && container.classList.contains('d3-evaluation-tree'))) {
93
+ const candidates = Array.from(document.querySelectorAll('.d3-evaluation-tree'))
94
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
95
+ container = candidates[candidates.length - 1] || null;
96
+ }
97
+ if (!container) return;
98
+ if (container.dataset) {
99
+ if (container.dataset.mounted === 'true') return;
100
+ container.dataset.mounted = 'true';
101
+ }
102
+
103
+ // Tooltip setup
104
+ container.style.position = container.style.position || 'relative';
105
+ let tip = container.querySelector('.d3-tooltip');
106
+ let tipInner;
107
+ if (!tip) {
108
+ tip = document.createElement('div');
109
+ tip.className = 'd3-tooltip';
110
+ tipInner = document.createElement('div');
111
+ tipInner.className = 'd3-tooltip__inner';
112
+ tipInner.style.textAlign = 'left';
113
+ tip.appendChild(tipInner);
114
+ container.appendChild(tip);
115
+ } else {
116
+ tipInner = tip.querySelector('.d3-tooltip__inner') || tip;
117
+ }
118
+
119
+ // Get colors from ColorPalettes with fallback
120
+ const getColors = () => {
121
+ if (window.ColorPalettes && window.ColorPalettes.getColors) {
122
+ return {
123
+ decision: window.ColorPalettes.getColors('sequential', 3)[0],
124
+ result: window.ColorPalettes.getColors('sequential', 3)[2],
125
+ warning: window.ColorPalettes.getColors('diverging', 3)[1]
126
+ };
127
+ }
128
+ // Fallback colors
129
+ return {
130
+ decision: '#60A5FA',
131
+ result: '#34D399',
132
+ warning: '#FBBF24'
133
+ };
134
+ };
135
+
136
+ // Define the decision tree structure
137
+ const treeData = {
138
+ name: "What are you\nevaluating?",
139
+ type: "decision",
140
+ tooltip: "Starting point: Identify your evaluation task",
141
+ children: [
142
+ {
143
+ name: "Have gold\nstandard?",
144
+ edgeLabel: "Start",
145
+ type: "decision",
146
+ tooltip: "Do you have a clear, correct reference answer?",
147
+ children: [
148
+ {
149
+ name: "Objective &\nverifiable?",
150
+ edgeLabel: "Yes",
151
+ type: "decision",
152
+ tooltip: "Is the answer factual and unambiguous?",
153
+ children: [
154
+ {
155
+ name: "Format\nconstrained?",
156
+ edgeLabel: "Yes",
157
+ type: "decision",
158
+ tooltip: "Can you verify output structure programmatically?",
159
+ children: [
160
+ {
161
+ name: "Functional\nTesting",
162
+ edgeLabel: "Yes",
163
+ type: "result",
164
+ tooltip: "Use IFEval-style functional tests or unit tests"
165
+ },
166
+ {
167
+ name: "Automated\nMetrics",
168
+ edgeLabel: "No",
169
+ type: "result",
170
+ tooltip: "Use exact match, F1, BLEU, etc."
171
+ }
172
+ ]
173
+ }
174
+ ]
175
+ },
176
+ {
177
+ name: "Human Eval\nor Judges",
178
+ edgeLabel: "Subjective",
179
+ type: "warning",
180
+ tooltip: "Multiple valid answers exist; need human judgment or model judges"
181
+ }
182
+ ]
183
+ },
184
+ {
185
+ name: "Budget &\nscale?",
186
+ edgeLabel: "No gold",
187
+ type: "decision",
188
+ tooltip: "No reference answer available",
189
+ children: [
190
+ {
191
+ name: "Expert Human\nAnnotators",
192
+ edgeLabel: "High",
193
+ type: "result",
194
+ tooltip: "Best for critical use cases (medical, legal)"
195
+ },
196
+ {
197
+ name: "Model Judges\n(validate!)",
198
+ edgeLabel: "Medium",
199
+ type: "warning",
200
+ tooltip: "Validate judge quality against human baseline"
201
+ },
202
+ {
203
+ name: "Arena or\nVibe-checks",
204
+ edgeLabel: "Low",
205
+ type: "warning",
206
+ tooltip: "Crowdsourced or exploratory evaluation"
207
+ }
208
+ ]
209
+ }
210
+ ]
211
+ };
212
+
213
+ // SVG setup
214
+ const svg = d3.select(container).append('svg');
215
+ const g = svg.append('g').attr('transform', 'translate(40, 30)');
216
+
217
+ let width = container.clientWidth || 900;
218
+ const nodeWidth = 140;
219
+ const nodeHeight = 50;
220
+
221
+ function render() {
222
+ const colors = getColors();
223
+ width = container.clientWidth || 900;
224
+
225
+ const treeLayout = d3.tree()
226
+ .size([width - 80, 500])
227
+ .separation((a, b) => (a.parent === b.parent ? 1.3 : 1.6));
228
+
229
+ const root = d3.hierarchy(treeData);
230
+ const treeNodes = treeLayout(root);
231
+
232
+ const maxDepth = root.height;
233
+ const height = (maxDepth + 1) * 120 + 60;
234
+
235
+ svg.attr('viewBox', `0 0 ${width} ${height}`)
236
+ .attr('preserveAspectRatio', 'xMidYMin meet');
237
+
238
+ // Clear previous
239
+ g.selectAll('*').remove();
240
+
241
+ // Links
242
+ g.selectAll('.link')
243
+ .data(treeNodes.links())
244
+ .join('path')
245
+ .attr('class', 'link')
246
+ .attr('d', d3.linkVertical()
247
+ .x(d => d.x)
248
+ .y(d => d.y)
249
+ );
250
+
251
+ // Link labels
252
+ g.selectAll('.link-label')
253
+ .data(treeNodes.links().filter(d => d.target.data.edgeLabel))
254
+ .join('text')
255
+ .attr('class', 'link-label')
256
+ .attr('x', d => d.target.x)
257
+ .attr('y', d => (d.source.y + d.target.y) / 2 - 5)
258
+ .attr('text-anchor', 'middle')
259
+ .text(d => d.target.data.edgeLabel);
260
+
261
+ // Node groups
262
+ const nodes = g.selectAll('.node')
263
+ .data(treeNodes.descendants())
264
+ .join('g')
265
+ .attr('class', 'node')
266
+ .attr('transform', d => `translate(${d.x},${d.y})`)
267
+ .on('mouseenter', function(event, d) {
268
+ if (d.data.tooltip) {
269
+ const [mx, my] = d3.pointer(event, container);
270
+ tip.style.opacity = '1';
271
+ tip.style.transform = `translate(${mx + 10}px, ${my - 10}px)`;
272
+ tipInner.textContent = d.data.tooltip;
273
+ }
274
+ })
275
+ .on('mouseleave', function() {
276
+ tip.style.opacity = '0';
277
+ tip.style.transform = 'translate(-9999px, -9999px)';
278
+ });
279
+
280
+ // Rectangles
281
+ nodes.append('rect')
282
+ .attr('class', d => {
283
+ if (d.data.type === 'result') return 'node-rect result-node';
284
+ if (d.data.type === 'warning') return 'node-rect warning-node';
285
+ return 'node-rect decision-node';
286
+ })
287
+ .attr('x', -nodeWidth / 2)
288
+ .attr('y', -nodeHeight / 2)
289
+ .attr('width', nodeWidth)
290
+ .attr('height', nodeHeight)
291
+ .attr('fill', d => {
292
+ if (d.data.type === 'result') return colors.result;
293
+ if (d.data.type === 'warning') return colors.warning;
294
+ return colors.decision;
295
+ });
296
+
297
+ // Text (multiline support)
298
+ nodes.each(function(d) {
299
+ const nodeG = d3.select(this);
300
+ const lines = d.data.name.split('\n');
301
+ const lineHeight = 14;
302
+ const startY = -(lines.length - 1) * lineHeight / 2;
303
+
304
+ lines.forEach((line, i) => {
305
+ nodeG.append('text')
306
+ .attr('class', 'node-text')
307
+ .attr('text-anchor', 'middle')
308
+ .attr('y', startY + i * lineHeight)
309
+ .attr('dy', '0.35em')
310
+ .text(line);
311
+ });
312
+ });
313
+ }
314
+
315
+ // Initial render
316
+ render();
317
+
318
+ // Responsive resize
319
+ if (window.ResizeObserver) {
320
+ const ro = new ResizeObserver(() => render());
321
+ ro.observe(container);
322
+ } else {
323
+ window.addEventListener('resize', render);
324
+ }
325
+ };
326
+
327
+ if (document.readyState === 'loading') {
328
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
329
+ } else {
330
+ ensureD3(bootstrap);
331
+ }
332
+ })();
333
+ </script>