Clémentine
commited on
Commit
·
1ce7bab
1
Parent(s):
0cb5d1f
up
Browse files
app/src/content/article.mdx
CHANGED
|
@@ -78,9 +78,7 @@ TODO: ADD A VIEWER
|
|
| 78 |
|
| 79 |
#### Task and metrics
|
| 80 |
|
| 81 |
-
You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type.
|
| 82 |
-
|
| 83 |
-
Best (but rarest) metrics are functional or based on rule based verifiers (though beware of pass/fail for coding models and code evaluations, as recent LLMs have become very good at overwriting globals to 'cheat' on such tests, especially in languages like Python where you can mess up variable scope).
|
| 84 |
|
| 85 |
### So, you can't reproduce reported model scores?
|
| 86 |
|
|
|
|
| 78 |
|
| 79 |
#### Task and metrics
|
| 80 |
|
| 81 |
+
You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type. Best (but rarest) metrics are functional or based on rule based verifiers <Sidenote> When doing code evals, beware of too easy pass/fail unit tests! Recent LLMs have become very good at overwriting globals to 'cheat', especially in languages like Python where you can mess up variable scope.</Sidenote>
|
|
|
|
|
|
|
| 82 |
|
| 83 |
### So, you can't reproduce reported model scores?
|
| 84 |
|
app/src/content/assets/image/mmlu_prompt.png
ADDED
|
Git LFS Details
|
app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx
CHANGED
|
@@ -4,6 +4,8 @@ title: "Troubleshooting reproducibility"
|
|
| 4 |
|
| 5 |
import Note from "../../../components/Note.astro";
|
| 6 |
import Sidenote from "../../../components/Sidenote.astro";
|
|
|
|
|
|
|
| 7 |
|
| 8 |
Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
|
| 9 |
Let's explore why.
|
|
@@ -51,37 +53,20 @@ We've observed that the following were easy things to mess up, even when using t
|
|
| 51 |
|
| 52 |
The format you are using for the prompt can and will change scores wildly.
|
| 53 |
|
| 54 |
-
For example, for multichoice question answers,
|
| 55 |
-
```
|
| 56 |
-
Question: <text of the question>
|
| 57 |
-
Choices:
|
| 58 |
-
```
|
| 59 |
-
```markdown
|
| 60 |
-
| A. <Choice A> | (A) <Choice A> | <Choice A> |
|
| 61 |
-
| B. <Choice B> | (B) <Choice B> | <Choice B> |
|
| 62 |
-
| C. <Choice C> | (C) <Choice C> | <Choice C> |
|
| 63 |
-
| D. <Choice D> | (D) <Choice D> | <Choice D> |
|
| 64 |
-
```
|
| 65 |
-
```
|
| 66 |
-
Answer:
|
| 67 |
-
```
|
| 68 |
-
and predicting either `A`/`B`/`C`/`D` or `<Choice A/B/C/D>`.
|
| 69 |
-
|
| 70 |
-
These prompts are **semantically equivalent**, as they contain the exact same content - but they can still result in a difference of *several points for the same model*.
|
| 71 |
-
|
| 72 |
|
| 73 |
<Note title="Prompt format sensitivity" emoji="📝" variant="danger">
|
| 74 |
-
We did some experiments on this [here](https://x.com/clefourrier/status/1777319187913875893/photo/1) (you'll see up to a 7 points difference for the same model) and a [paper observed similar results](https://arxiv.org/abs/2310.11324).
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
|
| 84 |
This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
|
|
|
|
| 85 |
|
| 86 |
Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.
|
| 87 |
|
|
|
|
| 4 |
|
| 5 |
import Note from "../../../components/Note.astro";
|
| 6 |
import Sidenote from "../../../components/Sidenote.astro";
|
| 7 |
+
import { Image } from "astro:assets";
|
| 8 |
+
import mmluPromptImage from "../../assets/image/mmlu_prompt.png";
|
| 9 |
|
| 10 |
Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
|
| 11 |
Let's explore why.
|
|
|
|
| 53 |
|
| 54 |
The format you are using for the prompt can and will change scores wildly.
|
| 55 |
|
| 56 |
+
For example, for multichoice question answers, common formats include very simple variations (e.g. using `A` vs `A.` vs `A)` to introduce choices), which, while **semantically equivalent** (as they contain the exact same content) can still result in difference of *several points for the same model*.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
<Note title="Prompt format sensitivity" emoji="📝" variant="danger">
|
|
|
|
| 59 |
|
| 60 |
+
<Image src={mmluPromptImage} alt="Heatmap showing MMLU evaluation scores across different models (Mistral-7B, Qwen1.5-7B, gemma-7b, phi-2, DeciLM-7B) with different prompt formats. Scores vary by up to 7 points for the same model depending on format." />
|
| 61 |
|
| 62 |
+
We did some experiments on this (you'll see up to a 7 points difference for the same model on the semantically equivalent prompts, the 5 rightmost columns).
|
| 63 |
|
| 64 |
+
A [paper observed similar results](https://arxiv.org/abs/2310.11324): models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability.
|
| 65 |
+
|
| 66 |
+
**Other example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly on the Open LLM Leaderboard, because they overfit to GSM8K's prompt format and couldn't adapt to the new one for this eval, despite it being provided in few shot examples.
|
| 67 |
|
| 68 |
This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
|
| 69 |
+
</Note>
|
| 70 |
|
| 71 |
Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.
|
| 72 |
|