Clémentine commited on
Commit
1ce7bab
·
1 Parent(s): 0cb5d1f
app/src/content/article.mdx CHANGED
@@ -78,9 +78,7 @@ TODO: ADD A VIEWER
78
 
79
  #### Task and metrics
80
 
81
- You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type.
82
-
83
- Best (but rarest) metrics are functional or based on rule based verifiers (though beware of pass/fail for coding models and code evaluations, as recent LLMs have become very good at overwriting globals to 'cheat' on such tests, especially in languages like Python where you can mess up variable scope).
84
 
85
  ### So, you can't reproduce reported model scores?
86
 
 
78
 
79
  #### Task and metrics
80
 
81
+ You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type. Best (but rarest) metrics are functional or based on rule based verifiers <Sidenote> When doing code evals, beware of too easy pass/fail unit tests! Recent LLMs have become very good at overwriting globals to 'cheat', especially in languages like Python where you can mess up variable scope.</Sidenote>
 
 
82
 
83
  ### So, you can't reproduce reported model scores?
84
 
app/src/content/assets/image/mmlu_prompt.png ADDED

Git LFS Details

  • SHA256: a5563b9b68413b080ed2d221f359884bc1c1e1cbc3b95aaa7f21a4c508b9fd14
  • Pointer size: 130 Bytes
  • Size of remote file: 67.6 kB
app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx CHANGED
@@ -4,6 +4,8 @@ title: "Troubleshooting reproducibility"
4
 
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
 
 
7
 
8
  Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
9
  Let's explore why.
@@ -51,37 +53,20 @@ We've observed that the following were easy things to mess up, even when using t
51
 
52
  The format you are using for the prompt can and will change scores wildly.
53
 
54
- For example, for multichoice question answers, some common formats include very simple variations when presenting the choices, such as:
55
- ```
56
- Question: <text of the question>
57
- Choices:
58
- ```
59
- ```markdown
60
- | A. <Choice A> | (A) <Choice A> | <Choice A> |
61
- | B. <Choice B> | (B) <Choice B> | <Choice B> |
62
- | C. <Choice C> | (C) <Choice C> | <Choice C> |
63
- | D. <Choice D> | (D) <Choice D> | <Choice D> |
64
- ```
65
- ```
66
- Answer:
67
- ```
68
- and predicting either `A`/`B`/`C`/`D` or `<Choice A/B/C/D>`.
69
-
70
- These prompts are **semantically equivalent**, as they contain the exact same content - but they can still result in a difference of *several points for the same model*.
71
-
72
 
73
  <Note title="Prompt format sensitivity" emoji="📝" variant="danger">
74
- We did some experiments on this [here](https://x.com/clefourrier/status/1777319187913875893/photo/1) (you'll see up to a 7 points difference for the same model) and a [paper observed similar results](https://arxiv.org/abs/2310.11324).
75
 
76
- Semantically identical prompts can cause 7+ point score differences!
77
 
78
- Even tiny formatting variations (like `A.` vs `(A)` vs just listing choices) significantly impact scores. Models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability.
79
 
80
- **Real example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly because they overfit to GSM8K's prompt format and couldn't adapt to different few-shot templates.
81
- This is something we observed on the Open LLM Leaderboard 2 for the Llama3.1 models. They were predicting the correct answers to our MATH-Hard evaluations, but were getting low scores, being unable to fit to the template provided in few-shot because they overfit the GSM8K prompt and answer format (another math eval).
82
- </Note>
83
 
84
  This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
 
85
 
86
  Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.
87
 
 
4
 
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
7
+ import { Image } from "astro:assets";
8
+ import mmluPromptImage from "../../assets/image/mmlu_prompt.png";
9
 
10
  Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
11
  Let's explore why.
 
53
 
54
  The format you are using for the prompt can and will change scores wildly.
55
 
56
+ For example, for multichoice question answers, common formats include very simple variations (e.g. using `A` vs `A.` vs `A)` to introduce choices), which, while **semantically equivalent** (as they contain the exact same content) can still result in difference of *several points for the same model*.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  <Note title="Prompt format sensitivity" emoji="📝" variant="danger">
 
59
 
60
+ <Image src={mmluPromptImage} alt="Heatmap showing MMLU evaluation scores across different models (Mistral-7B, Qwen1.5-7B, gemma-7b, phi-2, DeciLM-7B) with different prompt formats. Scores vary by up to 7 points for the same model depending on format." />
61
 
62
+ We did some experiments on this (you'll see up to a 7 points difference for the same model on the semantically equivalent prompts, the 5 rightmost columns).
63
 
64
+ A [paper observed similar results](https://arxiv.org/abs/2310.11324): models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability.
65
+
66
+ **Other example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly on the Open LLM Leaderboard, because they overfit to GSM8K's prompt format and couldn't adapt to the new one for this eval, despite it being provided in few shot examples.
67
 
68
  This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
69
+ </Note>
70
 
71
  Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.
72