evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 10 days ago

Commit

1ce7bab

1 Parent(s): 0cb5d1f

up

Browse files

Files changed (3) hide show

app/src/content/article.mdx +1 -3
app/src/content/assets/image/mmlu_prompt.png +3 -0
app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx +9 -24

app/src/content/article.mdx CHANGED Viewed

@@ -78,9 +78,7 @@ TODO: ADD A VIEWER
 #### Task and metrics
-You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type.
-Best (but rarest) metrics are functional or based on rule based verifiers (though beware of pass/fail for coding models and code evaluations, as recent LLMs have become very good at overwriting globals to 'cheat' on such tests, especially in languages like Python where you can mess up variable scope).
 ### So, you can't reproduce reported model scores?

 #### Task and metrics
+You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type. Best (but rarest) metrics are functional or based on rule based verifiers <Sidenote> When doing code evals, beware of too easy pass/fail unit tests! Recent LLMs have become very good at overwriting globals to 'cheat', especially in languages like Python where you can mess up variable scope.</Sidenote>
 ### So, you can't reproduce reported model scores?

app/src/content/assets/image/mmlu_prompt.png ADDED Viewed

Git LFS Details

SHA256: a5563b9b68413b080ed2d221f359884bc1c1e1cbc3b95aaa7f21a4c508b9fd14
Pointer size: 130 Bytes
Size of remote file: 67.6 kB

app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx CHANGED Viewed

@@ -4,6 +4,8 @@ title: "Troubleshooting reproducibility"
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
 Let's explore why.
@@ -51,37 +53,20 @@ We've observed that the following were easy things to mess up, even when using t
 The format you are using for the prompt can and will change scores wildly.
-For example, for multichoice question answers, some common formats include very simple variations when presenting the choices, such as:
-```
-Question: <text of the question>
-Choices:
-```
-```markdown
-| A. <Choice A> | (A) <Choice A> | <Choice A> |
-| B. <Choice B> | (B) <Choice B> | <Choice B> |
-| C. <Choice C> | (C) <Choice C> | <Choice C> |
-| D. <Choice D> | (D) <Choice D> | <Choice D> |
-```
-```
-Answer:
-```
-and predicting either `A`/`B`/`C`/`D` or `<Choice A/B/C/D>`.
-These prompts are **semantically equivalent**, as they contain the exact same content - but they can still result in a difference of *several points for the same model*.
 <Note title="Prompt format sensitivity" emoji="📝" variant="danger">
-We did some experiments on this [here](https://x.com/clefourrier/status/1777319187913875893/photo/1) (you'll see up to a 7 points difference for the same model) and a [paper observed similar results](https://arxiv.org/abs/2310.11324).
-Semantically identical prompts can cause 7+ point score differences!
-Even tiny formatting variations (like `A.` vs `(A)` vs just listing choices) significantly impact scores. Models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability.
-**Real example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly because they overfit to GSM8K's prompt format and couldn't adapt to different few-shot templates.
-This is something we observed on the Open LLM Leaderboard 2 for the Llama3.1 models. They were predicting the correct answers to our MATH-Hard evaluations, but were getting low scores, being unable to fit to the template provided in few-shot because they overfit the GSM8K prompt and answer format (another math eval).
-</Note>
 This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
 Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
+import { Image } from "astro:assets";
+import mmluPromptImage from "../../assets/image/mmlu_prompt.png";
 Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
 Let's explore why.
 The format you are using for the prompt can and will change scores wildly.
+For example, for multichoice question answers, common formats include very simple variations (e.g. using `A` vs `A.` vs `A)` to introduce choices), which, while **semantically equivalent** (as they contain the exact same content) can still result in difference of *several points for the same model*.
 <Note title="Prompt format sensitivity" emoji="📝" variant="danger">
+<Image src={mmluPromptImage} alt="Heatmap showing MMLU evaluation scores across different models (Mistral-7B, Qwen1.5-7B, gemma-7b, phi-2, DeciLM-7B) with different prompt formats. Scores vary by up to 7 points for the same model depending on format." />
+We did some experiments on this (you'll see up to a 7 points difference for the same model on the semantically equivalent prompts, the 5 rightmost columns).
+A [paper observed similar results](https://arxiv.org/abs/2310.11324): models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability.
+**Other example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly on the Open LLM Leaderboard, because they overfit to GSM8K's prompt format and couldn't adapt to the new one for this eval, despite it being provided in few shot examples.
 This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
+</Note>
 Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.