evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 13 days ago

Commit

b1f0eb8

1 Parent(s): ff1d2e7

ongoing

Browse files

Files changed (19) hide show

app/src/content/article.mdx +11 -5
app/src/content/chapters/automated-benchmarks/basics.mdx +1 -6
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +24 -26
app/src/content/chapters/automated-benchmarks/some-evaluation-datasets.mdx +4 -4
app/src/content/chapters/automated-benchmarks/tips-and-tricks.mdx +22 -8
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx +4 -4
app/src/content/chapters/general-knowledge/tokenization.mdx +8 -8
app/src/content/chapters/human-evaluation/basics.mdx +10 -10
app/src/content/chapters/human-evaluation/tips-and-tricks.mdx +5 -5
app/src/content/chapters/human-evaluation/using-human-annotators.mdx +1 -1
app/src/content/chapters/model-as-a-judge/basics.mdx +4 -4
app/src/content/chapters/model-as-a-judge/designing-your-evaluation-prompt.mdx +3 -3
app/src/content/chapters/model-as-a-judge/evaluating-your-evaluator.mdx +4 -4
app/src/content/chapters/model-as-a-judge/getting-a-judge-llm.mdx +4 -4
app/src/content/chapters/model-as-a-judge/tips-and-tricks.mdx +3 -3
app/src/content/chapters/model-as-a-judge/what-about-reward-models.mdx +5 -5
app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx +6 -6
app/src/content/chapters/troubleshooting/troubleshooting-math-parsing.mdx +1 -1
app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx +5 -5

app/src/content/article.mdx CHANGED Viewed

@@ -16,8 +16,6 @@ tags:
 tableOfContentsAutoCollapse: true
 ---
-import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
-import Tokenization from "./chapters/general-knowledge/tokenization.mdx";
 import AutomatedBenchmarksBasics from "./chapters/automated-benchmarks/basics.mdx";
 import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx";
 import AutomatedBenchmarksTips from "./chapters/automated-benchmarks/tips-and-tricks.mdx";
@@ -33,11 +31,11 @@ import ModelAsJudgeTips from "./chapters/model-as-a-judge/tips-and-tricks.mdx";
 import TroubleshootingInference from "./chapters/troubleshooting/troubleshooting-inference.mdx";
 import TroubleshootingMathParsing from "./chapters/troubleshooting/troubleshooting-math-parsing.mdx";
 import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
-<ModelInferenceAndEvaluation />
-<Tokenization />
 <AutomatedBenchmarksBasics />
 <DesigningAutomaticEvaluation />
@@ -45,12 +43,15 @@ import TroubleshootingReproducibility from "./chapters/troubleshooting/troublesh
 <AutomatedBenchmarksTips />
 <HumanEvaluationBasics />
 <UsingHumanAnnotators />
 <HumanEvaluationTips />
 <ModelAsJudgeBasics />
 <GettingJudgeLLM />
@@ -63,10 +64,15 @@ import TroubleshootingReproducibility from "./chapters/troubleshooting/troublesh
 <ModelAsJudgeTips />
 <TroubleshootingInference />
 <TroubleshootingMathParsing />
 <TroubleshootingReproducibility />

 tableOfContentsAutoCollapse: true
 ---
 import AutomatedBenchmarksBasics from "./chapters/automated-benchmarks/basics.mdx";
 import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx";
 import AutomatedBenchmarksTips from "./chapters/automated-benchmarks/tips-and-tricks.mdx";
 import TroubleshootingInference from "./chapters/troubleshooting/troubleshooting-inference.mdx";
 import TroubleshootingMathParsing from "./chapters/troubleshooting/troubleshooting-math-parsing.mdx";
 import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
+import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
+import Tokenization from "./chapters/general-knowledge/tokenization.mdx";
+## Automated Benchmarks
 <AutomatedBenchmarksBasics />
 <DesigningAutomaticEvaluation />
 <AutomatedBenchmarksTips />
+## Human Evaluations
 <HumanEvaluationBasics />
 <UsingHumanAnnotators />
 <HumanEvaluationTips />
+## Model judges
 <ModelAsJudgeBasics />
 <GettingJudgeLLM />
 <ModelAsJudgeTips />
+## Troubleshooting tips
 <TroubleshootingInference />
 <TroubleshootingMathParsing />
 <TroubleshootingReproducibility />
+## Appendix
+<ModelInferenceAndEvaluation />
+<Tokenization />

app/src/content/chapters/automated-benchmarks/basics.mdx CHANGED Viewed

@@ -2,11 +2,6 @@
 title: "Automated Benchmarks: Basics"
 ---
-# Basics
-*Note: Some of this overlaps with [my general blog on evals](https://huggingface.co/blog/clefourrier/llm-evaluation)*
-## What are automated benchmarks?
 Automated benchmarks usually works the following way: you'd like to know how well your model performs on something. This something can be a well-defined concrete **task**, such as `How well can my model classify spam from non spam emails?`, or a more abstract and general **capability**, such as `How good is my model at math?`.
 From this, you construct an evaluation, using:
@@ -25,7 +20,7 @@ This is more interesting to do on data that the model has never been exposed to
 Note: *A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. Similarly to a student who learned test questions by heart without understanding the topic, evaluating LLMs on data that was already present in their training set is scoring them on capabilities they do not possess.*
-## Pros and cons of using automated benchmarks
 Automated benchmarks have the following advantages:
 - **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
 - **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.

 title: "Automated Benchmarks: Basics"
 ---
 Automated benchmarks usually works the following way: you'd like to know how well your model performs on something. This something can be a well-defined concrete **task**, such as `How well can my model classify spam from non spam emails?`, or a more abstract and general **capability**, such as `How good is my model at math?`.
 From this, you construct an evaluation, using:
 Note: *A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. Similarly to a student who learned test questions by heart without understanding the topic, evaluating LLMs on data that was already present in their training set is scoring them on capabilities they do not possess.*
+### Pros and cons of using automated benchmarks
 Automated benchmarks have the following advantages:
 - **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
 - **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -2,14 +2,17 @@
 title: "Designing your automatic evaluation"
 ---
-# Designing your automatic evaluation
-## Choosing a dataset
-For your evaluation, you can either select an existing dataset (see [Some evaluation datasets](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md) for examples) or design your own. Through this process, it's very important to keep in mind that **your evaluation result will only be as good as your evaluation dataset**.
-### Selecting an existing dataset
-You must imperatively look at its components.
-#### Creation process
 - **Who created the actual samples?**
 Imo, expert created dataset > paid annotator dataset ~ crowdsourced dataset > MTurked dataset.
 You also want to look for a data card, where you'll find annotator demographics - this can be important to understand the dataset language diversity.
@@ -23,33 +26,28 @@ This is especially important for datasets with the help of underpaid annotators
 - **Were the annotators provided with clear data creation guidelines?**
 In other words, is your dataset consistent?
-#### Samples
 Take 50 random samples and manually inspect them:
 - *For quality*:
 	- are the prompts clear and unambiguous?
 	- are the answers correct? (*Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.*)
 	- is information missing? (*Eg: MMLU misses reference schematics in a number of questions.*)
 - *For relevance to your task*:
-	- are these questions the kind of questions you want to evaluate an LLM on?
 	- are these examples relevant to your use case?
 You also want to know how many samples are present there (to make sure results are statistically significant - 100 samples is usually a minimum for automatic benchmarks).
-### Designing your own
 You can go 3 ways when designing your own dataset.
-#### Aggregating existing data
-You can aggregate existing data from different sources, evaluating a relevant capability for your task. A number of evaluation datasets are for example constructed from aggregating human evaluation datasets (such as MATH, LSAT, etc). In this case, follow the steps above.
-#### Using human annotators
-There's a whole section on using human annotators in `Human evaluation`, see [Using human annotators](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/using-human-annotators.md).
-#### Using synthetic data
-- **Using LLMs**
-On this, you can check the very cool [Cosmopedia](https://huggingface.co/blog/cosmopedia) blog by cool HF colleagues! It's mostly studying how to create a synthetic training dataset, but similar techniques can be used for evaluation.
-Make sure to manually check/filter/inspect your dataset afterwards (following the above steps).
-- **Using rule-based techniques**
-If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination!
-For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), etc.
-## Choosing an inference method
 You'll need to choose what kind of inference method you need.
 Using log-probabilities (MCQA, multi-choice question answer) is very good for multiple choice question answers (usually to test model knowledge, or ability to disambiguate).
@@ -69,7 +67,7 @@ Using generations (QA, question answering) is very good for any task where you w
 	- Can be harder to score (see the `metrics` section below)
 	- Usually slightly more expensive than log likelihood evaluations, especially if they include sampling
-## Choosing a prompt
 The prompt is going to define:
 - how much information is given to your model about the task
 - how this information is presented to your model.
@@ -94,7 +92,7 @@ When defining your prompt, you need to be aware that:
 - for a number of metrics, you want a very constrained generation or output.
   *You can learn more about this in the `Constraining model outputs` section of the [Model inference and evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/model-inference-and-evaluation.md) page.*
-## Choosing a metric
 If you are looking at **log-probabilities**, your metrics are going to be easy: you'll want to look at accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
 For **generative** evaluations, your range of metrics is going to be wider.
@@ -108,7 +106,7 @@ You'll need to
 More generally, when picking your metric, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc). (*To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/)*)
-## Smart new tasks: what about functional testing?
 In the field of code, you want to evaluate generated programs not only on their semantics, but on their actual function. A good way to do so is therefore to check if code generated to follow a prompt passes correctly a suite of unit-tests designed to fit the task.
 This functionality approach is extremely promising, as it

 title: "Designing your automatic evaluation"
 ---
+### Designing your automatic evaluation
+#### Selecting or creating a dataset
+For your evaluation, you can either select an existing dataset or design your own. Through this process, it's very important to keep in mind that **your evaluation result will only be as good as your evaluation dataset**.
+##### Inspecting an existing dataset.
+You want to study the following.
+1. Creation process
 - **Who created the actual samples?**
 Imo, expert created dataset > paid annotator dataset ~ crowdsourced dataset > MTurked dataset.
 You also want to look for a data card, where you'll find annotator demographics - this can be important to understand the dataset language diversity.
 - **Were the annotators provided with clear data creation guidelines?**
 In other words, is your dataset consistent?
+2. Samples
 Take 50 random samples and manually inspect them:
 - *For quality*:
 	- are the prompts clear and unambiguous?
 	- are the answers correct? (*Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.*)
 	- is information missing? (*Eg: MMLU misses reference schematics in a number of questions.*)
 - *For relevance to your task*:
+	- are these
+	questions the kind of questions you want to evaluate an LLM on?
 	- are these examples relevant to your use case?
+3. Quantity
 You also want to know how many samples are present there (to make sure results are statistically significant - 100 samples is usually a minimum for automatic benchmarks).
+##### Designing your own
 You can go 3 ways when designing your own dataset.
+- **Aggregating existing data**: You can aggregate existing data from different sources, evaluating a relevant capability for your task. A number of evaluation datasets are for example constructed from aggregating human evaluation datasets (such as MATH, LSAT, etc). In this case, follow the steps above.
+- **Using human annotators**: There's a whole section on using human annotators in `Human evaluation`, see [Using human annotators](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/using-human-annotators.md).
+- **Using synthetic data from models**: On this, you can check the very cool [Cosmopedia](https://huggingface.co/blog/cosmopedia) blog by cool HF colleagues! It's mostly studying how to create a synthetic training dataset, but similar techniques can be used for evaluation. Make sure to manually check/filter/inspect your dataset afterwards (following the above steps).
+- **Using rule-based techniques**: If your task allows, this is a very good way to get a virtually infinite supply of samples and avoid contamination! For some examples, you can look at [NPHardEval](https://arxiv.org/abs/2312.14890), [DyVal](https://arxiv.org/abs/2309.17167), [MuSR](https://arxiv.org/abs/2310.16049), [BabiQA](https://arxiv.org/abs/1502.05698), etc.
+#### Choosing an inference method for your model
 You'll need to choose what kind of inference method you need.
 Using log-probabilities (MCQA, multi-choice question answer) is very good for multiple choice question answers (usually to test model knowledge, or ability to disambiguate).
 	- Can be harder to score (see the `metrics` section below)
 	- Usually slightly more expensive than log likelihood evaluations, especially if they include sampling
+#### Choosing a prompt
 The prompt is going to define:
 - how much information is given to your model about the task
 - how this information is presented to your model.
 - for a number of metrics, you want a very constrained generation or output.
   *You can learn more about this in the `Constraining model outputs` section of the [Model inference and evaluation](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/general-knowledge/model-inference-and-evaluation.md) page.*
+#### Choosing a metric
 If you are looking at **log-probabilities**, your metrics are going to be easy: you'll want to look at accuracy (how often the most likely choice is the best choice). It's important to normalize it by length (either character, token, or pmi). You could also look at perplexity, recall, or f1 score.
 For **generative** evaluations, your range of metrics is going to be wider.
 More generally, when picking your metric, you need to keep in mind what your task is really about. For some domains (ex: medical, chatbots with public interaction), you don't want to measure the average performance, but need a way to evaluate the **worst performance** you'll get (on medical quality of output, on toxicity, etc). (*To go further, take a look at this [blog](https://ehudreiter.com/2024/07/10/challenges-in-evaluating-llms/)*)
+#### Smart new tasks: what about functional testing?
 In the field of code, you want to evaluate generated programs not only on their semantics, but on their actual function. A good way to do so is therefore to check if code generated to follow a prompt passes correctly a suite of unit-tests designed to fit the task.
 This functionality approach is extremely promising, as it

app/src/content/chapters/automated-benchmarks/some-evaluation-datasets.mdx CHANGED Viewed

@@ -2,7 +2,7 @@
 title: "Some evaluation datasets"
 ---
-# Some evaluation datasets
 If the task you are interested is already well studied, chances are that a dataset exists for it.
@@ -14,7 +14,7 @@ However, careful:
 	 (*This will also be updated with post LLM evals at some point*)
 - They are likely contaminated, as they have been publicly on the web for a number of years. However, it doesn't mean they won't hold signal for your task!
-## Math specific datasets
 | Evaluation name | Task type | Publication date | Data size | Task data  | Task/Paper content  | Source | Dataset  | Comments  |
 |-----             |------    |-                 |--         |------------|-------------        |--------|--------  |---------- |
@@ -66,7 +66,7 @@ However, careful:
 | TemplateGSM | LLM-generated data  | 2024 | 7M  | GPT4-generated math word problems inspired in shape by GSM8K | Paper uses GPT4 generated meta-template to generate problems by changing parameters. Uses a verificator to ensure usability | [Paper](https://templatemath.github.io/TemplateMath_Part_I.pdf)  | [HuggingFace](https://huggingface.co/datasets/math-ai/TemplateGSM) | - Since everything is LLM generated, I would expect stronger proofs of quality  |
 | TheoremQA | Online sources adaptations  | 2023 | 800 | QAs about university level theorems  | Protocol: Uses GPT4 to enumerate subfields of relevant domains, then plausible theorems lists, then uses domain experts to actually look for said theorems, then look for QA on the web concerning them | [Paper](https://arxiv.org/abs/2305.12524)  | [HuggingFace](https://huggingface.co/datasets/TIGER-Lab/TheoremQA) | |
-## Pre-LLM datasets
 | Evaluation name | Task type | Task data | Task content | Source | Dataset | Comments |
 |--- |--- |--- |--- |--- |--- |--- |
@@ -146,7 +146,7 @@ However, careful:
 | XSUM | Summarization | 226K news articles (BBC, 2010 to 2017) matched with their single sentence summary (comes from the article). Task: Summarize. (Domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts) | | [Paper](https://aclanthology.org/D18-1206/) | [Github](https://github.com/EdinburghNLP/XSum) | |
 | XSum | Generation, Summarization | 226K news summary/article pairs from the BBC (2010 - 2017) extracted from the WayBack machine | | [Paper](https://aclanthology.org/D18-1206/) | [Hugging Face](https://huggingface.co/datasets/xsum)| Could be interesting to manually check if the model recent knowledge creates discrepancies in the summaries of old news. |
-## Dataset ideas to manually reproduce
 | Evaluation name                | Task type                                                  | Task content                                                                                                                                                                                                                                                              | Source                                                                             | Dataset                                                                                  | Comments                                                                                                                                                                                         |     |
 | ------------------------------ | ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |

 title: "Some evaluation datasets"
 ---
+### Some evaluation datasets
 If the task you are interested is already well studied, chances are that a dataset exists for it.
 	 (*This will also be updated with post LLM evals at some point*)
 - They are likely contaminated, as they have been publicly on the web for a number of years. However, it doesn't mean they won't hold signal for your task!
+### Math specific datasets
 | Evaluation name | Task type | Publication date | Data size | Task data  | Task/Paper content  | Source | Dataset  | Comments  |
 |-----             |------    |-                 |--         |------------|-------------        |--------|--------  |---------- |
 | TemplateGSM | LLM-generated data  | 2024 | 7M  | GPT4-generated math word problems inspired in shape by GSM8K | Paper uses GPT4 generated meta-template to generate problems by changing parameters. Uses a verificator to ensure usability | [Paper](https://templatemath.github.io/TemplateMath_Part_I.pdf)  | [HuggingFace](https://huggingface.co/datasets/math-ai/TemplateGSM) | - Since everything is LLM generated, I would expect stronger proofs of quality  |
 | TheoremQA | Online sources adaptations  | 2023 | 800 | QAs about university level theorems  | Protocol: Uses GPT4 to enumerate subfields of relevant domains, then plausible theorems lists, then uses domain experts to actually look for said theorems, then look for QA on the web concerning them | [Paper](https://arxiv.org/abs/2305.12524)  | [HuggingFace](https://huggingface.co/datasets/TIGER-Lab/TheoremQA) | |
+### Pre-LLM datasets
 | Evaluation name | Task type | Task data | Task content | Source | Dataset | Comments |
 |--- |--- |--- |--- |--- |--- |--- |
 | XSUM | Summarization | 226K news articles (BBC, 2010 to 2017) matched with their single sentence summary (comes from the article). Task: Summarize. (Domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts) | | [Paper](https://aclanthology.org/D18-1206/) | [Github](https://github.com/EdinburghNLP/XSum) | |
 | XSum | Generation, Summarization | 226K news summary/article pairs from the BBC (2010 - 2017) extracted from the WayBack machine | | [Paper](https://aclanthology.org/D18-1206/) | [Hugging Face](https://huggingface.co/datasets/xsum)| Could be interesting to manually check if the model recent knowledge creates discrepancies in the summaries of old news. |
+### Dataset ideas to manually reproduce
 | Evaluation name                | Task type                                                  | Task content                                                                                                                                                                                                                                                              | Source                                                                             | Dataset                                                                                  | Comments                                                                                                                                                                                         |     |
 | ------------------------------ | ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |

app/src/content/chapters/automated-benchmarks/tips-and-tricks.mdx CHANGED Viewed

@@ -2,9 +2,25 @@
 title: "Automated Benchmarks: Tips and tricks"
 ---
-# Tips and tricks
-## Managing contamination
 In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
 Solutions to mitigate this include:
@@ -15,9 +31,7 @@ Solutions to mitigate this include:
 However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training.
-## Practical issues you might encounter
-### Fine-tuned models, system prompts and chat templates
 A number of instruction tuned models are going to perform terribly if you do not make sure to:
 - add their system prompt at the very beginning of inference
 - prompt them using a chat template (usually adding `Assistant` and `User` prefixes to the dialogue turns - learn more about this in [this cool guide](https://huggingface.co/docs/transformers/main/en/chat_templating))
@@ -26,7 +40,7 @@ It's also very important to not assume that different tokenizers will behave the
 ![Spacing, tokenization and template](https://pbs.twimg.com/media/GPANfpiasAA9b6F?format=png&name=medium)
-### Tokenization
 1. **Tokenizing the context and choices together or separately**
@@ -50,14 +64,14 @@ When looking at multilingual evaluations, you'll also need to see how to tokeniz
 Code models usually have been trained with `\n\t` as a single token. This means that when generating text, they will often generate `\n\t` in one step. A task which defines `\n` as an end of sentence token (= to stop the generation) will let the model continue generating after a `\n\t`, if predicted as one token, since it's not the same as `\n`. But you would actually still want the model to stop. In these cases, you either need to update your end of sentence tokens, or define a mechanism to backtrack on the character representation of the latest tokens to stop (and cut) the generation a posteriori.
-### Easy speed up for MCQA evaluations
 You can speed up your MCQA predictions by a lot if you make sure your model needs to predict only one token for the task.
 This way, instead of running your `number_of_choices` predictions (`context + choice 1`, `context + choice 2`, etc), you can simply run inference on `context` and compute the probability distribution on the full vocabulary (which will include all your one token choices) to get your logprobabilities of interest, and do this step in one pass.
 (That's how we do it in `lighteval`).
-## Unexpectedly bad results on generative evaluations
 The first thing to do is always to inspect your model generations in detail. Some frequent things to look for when troubleshooting are:
 - too strict model output parsing (before computing the metric) which leads to the answer being lost

 title: "Automated Benchmarks: Tips and tricks"
 ---
+### Pros and cons of using automated benchmarks
+Automated benchmarks have the following advantages:
+- **Consistency and reproducibility**: You can run the same automated benchmark 10 times on the same model and you'll get the same results (baring variations in hardware or inherent model randomness). This means that you can easily create fair rankings of models for a given task.
+- **Scale at limited cost**: They are one of the cheapest way to evaluate models at the moment.
+- **Understandability**: Most automated metrics are very understandable.
+  *Eg: an exact match will tell you if the generated text matches perfectly with the reference, and an accuracy score will tell you in how many cases the selected choice was the correct one (this will be a bit less the case for metrics such as `BLEU` or `ROUGE` for example).*
+- **Dataset quality**: A number of automated benchmarks are using expert generated datasets or pre-existing high quality data (like MMLU or MATH). However, this does not mean these datasets are perfect: for MMLU, several errors have been identified in samples afterwards, from parsing issues to actually non-sensical questions, leading to the creation of several follow-up datasets, like MMLU-Pro and MMLU-Redux.
+However, they also present the following limitations:
+- **Reduced use on more complex tasks**: Automated benchmarks are working well for tasks where performance is easy to define and assess (for example, classification). More complex capabilities, on the other hand, are harder to decompose into well-defined and precise tasks.
+  *Eg: what does "good at math" mean? Is it being good at arithmetic? - at logic? - able to reason on new mathematical concepts?*
+  This led to the use of more **generalist** evaluations, which no longer decompose capabilities in sub-tasks, but assuming that general performance will be a **good proxy** for what we aim to measure.
+- **Contamination**: Once a dataset is published publicly in plain text, it will end up in model training datasets. This means that you have no guarantee when scoring a model that it has not parsed the evaluation data before.
+### Tips and tricks
+#### Managing contamination
 In general, you should assume that a dataset publicly available on the internet is or will be contaminated.
 Solutions to mitigate this include:
 However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training.
+#### Managing fine-tuned models, system prompts and chat templates
 A number of instruction tuned models are going to perform terribly if you do not make sure to:
 - add their system prompt at the very beginning of inference
 - prompt them using a chat template (usually adding `Assistant` and `User` prefixes to the dialogue turns - learn more about this in [this cool guide](https://huggingface.co/docs/transformers/main/en/chat_templating))
 ![Spacing, tokenization and template](https://pbs.twimg.com/media/GPANfpiasAA9b6F?format=png&name=medium)
+#### Beware of tokenization
 1. **Tokenizing the context and choices together or separately**
 Code models usually have been trained with `\n\t` as a single token. This means that when generating text, they will often generate `\n\t` in one step. A task which defines `\n` as an end of sentence token (= to stop the generation) will let the model continue generating after a `\n\t`, if predicted as one token, since it's not the same as `\n`. But you would actually still want the model to stop. In these cases, you either need to update your end of sentence tokens, or define a mechanism to backtrack on the character representation of the latest tokens to stop (and cut) the generation a posteriori.
+####  Tip: an easy speed up for MCQA evaluations
 You can speed up your MCQA predictions by a lot if you make sure your model needs to predict only one token for the task.
 This way, instead of running your `number_of_choices` predictions (`context + choice 1`, `context + choice 2`, etc), you can simply run inference on `context` and compute the probability distribution on the full vocabulary (which will include all your one token choices) to get your logprobabilities of interest, and do this step in one pass.
 (That's how we do it in `lighteval`).
+#### What to do if you get unexpectedly bad results on generative evaluations
 The first thing to do is always to inspect your model generations in detail. Some frequent things to look for when troubleshooting are:
 - too strict model output parsing (before computing the metric) which leads to the answer being lost

app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED Viewed

@@ -7,9 +7,9 @@ import llmLogprob from '../../assets/image/llm_logprob.png';
 import llmGen from '../../assets/image/llm_gen.png';
 import Image from '../../../components/Image.astro';
-# Model inference and evaluation
-## Introduction
 Current large language model work in a simple way: given some text as input, they have learned to predict plausible follow up.
 This is done in two steps.
@@ -22,7 +22,7 @@ The input text (called a *prompt* at inference) is first split into *tokens*, sm
 From this input text, the LLM generates a probability distribution of the most likely next tokens over all the vocabulary. To get a continued generation, we can take the most probable token (give or take some added randomness to get more interesting outputs) as the next one, then repeat the operation, using the new token as the end of the prompt, etc.
-## What do you want to predict?
 LLM evaluations mostly fall into 2 categories:
 - Given a prompt and one (or several) answers, what is probability of said answer(s) for my model?
 - Given a prompt, what text does my model generate?
@@ -57,7 +57,7 @@ We can then compare this generation with references and score the distance betwe
 -  ⭐ [Blog on several ways to evaluate MMLU](https://huggingface.co/blog/open-llm-leaderboard-mmlu) , by my team at Hugging Face. I recommend reading it if you want to delve deeper into the differences between multi choice log-likelihood evaluations and generative ones, including what it can mean with respect to score changes
 	- The above illustrations come from the blog and have been made by Thom Wolf
 - ⭐ [A beautiful mathematical formalization of the above inference methods](https://arxiv.org/abs/2405.14782v2), from EleutherAI. Go to the Appendix directly.
-## Constraining model outputs
 In a number of cases, we want the model output to follow a specific format, for example to compare them to a reference.
 ### Using a prompt
 The easiest way to do this is to add a task prompt which contains very specific instructions as to how the model should answer (`Provide numerical answers in digits.`,`Use no abbreviation.`, etc).

 import llmGen from '../../assets/image/llm_gen.png';
 import Image from '../../../components/Image.astro';
+### Model inference and evaluation
+### Introduction
 Current large language model work in a simple way: given some text as input, they have learned to predict plausible follow up.
 This is done in two steps.
 From this input text, the LLM generates a probability distribution of the most likely next tokens over all the vocabulary. To get a continued generation, we can take the most probable token (give or take some added randomness to get more interesting outputs) as the next one, then repeat the operation, using the new token as the end of the prompt, etc.
+### What do you want to predict?
 LLM evaluations mostly fall into 2 categories:
 - Given a prompt and one (or several) answers, what is probability of said answer(s) for my model?
 - Given a prompt, what text does my model generate?
 -  ⭐ [Blog on several ways to evaluate MMLU](https://huggingface.co/blog/open-llm-leaderboard-mmlu) , by my team at Hugging Face. I recommend reading it if you want to delve deeper into the differences between multi choice log-likelihood evaluations and generative ones, including what it can mean with respect to score changes
 	- The above illustrations come from the blog and have been made by Thom Wolf
 - ⭐ [A beautiful mathematical formalization of the above inference methods](https://arxiv.org/abs/2405.14782v2), from EleutherAI. Go to the Appendix directly.
+### Constraining model outputs
 In a number of cases, we want the model output to follow a specific format, for example to compare them to a reference.
 ### Using a prompt
 The easiest way to do this is to add a task prompt which contains very specific instructions as to how the model should answer (`Provide numerical answers in digits.`,`Use no abbreviation.`, etc).

app/src/content/chapters/general-knowledge/tokenization.mdx CHANGED Viewed

@@ -2,9 +2,9 @@
 title: "Tokenization"
 ---
-# Tokenization
-## Why and how do we tokenize text?
 Since large language models are actually big mathematical functions, they eat numbers, not text.
 Say you want to transform a sentence to numbers. You first need to decide how to cut your sentence into small pieces, then map every small piece to a number; this is *tokenization*.
@@ -19,18 +19,18 @@ Some people therefore had the idea to cut words into sub-words, and assign index
 This was initially done using morpho-syntactic rules ("morpho-syntax" is like the grammar of word creation). Now most people use byte pair encoding (BPE), a smart statistical method to create the sub-words automatically depending on their frequency in a reference text.
 So as a summary: tokenization is a way to map small units of texts (which can be one or several characters, up to the word level) to numbers (similar to an index). When you want to process text, your input text (called a *prompt* at inference) is split into these *tokens* by a tokenizer. The whole range of tokens a model or tokenizer can parse is called its *vocabulary*.
-#### Going further: Understanding tokenization
 I advise reading one of the first 2 links in depth.
 - ⭐ [Explanation of different tokenization methods in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter2/4)
 - ⭐ [Conceptual guide about tokenization in the 🤗 doc](https://huggingface.co/docs/transformers/en/tokenizer_summary)
 - [Course by Jurafsky on tokenization (and other things)](https://web.stanford.edu/~jurafsky/slp3/2.pdf) - more academical in its approach, skip to 2.5 and 2.6 (the rest is interesting too but too broad)
-#### Going further: Byte Pair Encoding
 - ⭐ [Explanation of BPE in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter6/5)
 - [Paper introducing BPE to NLP](https://aclanthology.org/P16-1162/)
-## Some of the many problems of tokenizations
 ### Choosing the correct vocabulary size
 The size of the vocabulary indicates how many individual tokens (for example, sub-words) the model will have to learn.
@@ -47,7 +47,7 @@ Let's go back to our above example, where we tokenized words derived from `simil
 Where the first method splits `similarly` into tokens which have an individual semantic  meaning, it's not the case in the second method: with too small a vocabulary, we lost some semantic representation. The difference in representations length also means that it's many times as costly to generate our word with a smaller vocabulary (takes 9 tokens instead of 2, so 5 times more costly!).
 At the moment, most people seem to use heuristics for vocabulary size, which seems correlated to number of languages covered and model size, so it's likely that using a number of tokens close to the reference models of a similar size could work for you.
-#### Going further: Rare tokens effect
 - [SolidGoldMagikarp post on Less Wrong](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)
 	- Very interesting read on how some people identified very rare tokens in Open AI's vocabulary - this is quite cool because it's done without access to the model's internals (we don't know what the training data contains for example)
 - [Fishing for Magikarp, paper by Cohere](https://arxiv.org/abs/2405.05417)
@@ -63,7 +63,7 @@ However, if you want to allow your tokenizer to correctly split text in other la
 This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
-#### Going further: Language and tokenization
 - ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
 	- The breakdown in itself is very clear, and it's worth playing around with the [demo space](https://huggingface.co/spaces/yenniejun/tokenizers-languages)
 - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/)
@@ -71,6 +71,6 @@ This effect leads to an unfairness in multilingual tokenization: some (less freq
 ### What about numbers?
 When building your tokenizer, you need to decide what to do about numbers. Do you only index 0 to 9, and assume all other numbers will be compositions of digits, or do you want to store numbers up to, say, one billion, individually? Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. Maybe new approaches to tokenization, such as hierarchical tokenization, might be needed for this.
-#### Going further: Number tokenization
 - ⭐ [A nice visual demo by Yennie Jun of how tokenizers of Anthropic, Meta, OpenAI, and Mistral models split numbers](https://www.artfish.ai/p/how-would-you-tokenize-or-break-down)
 - [Small history by Beren Millidge of the evolution of number tokenization through the years](https://www.beren.io/2024-05-11-Integer-tokenization-is-now-much-less-insane/)

 title: "Tokenization"
 ---
+### Tokenization
+### Why and how do we tokenize text?
 Since large language models are actually big mathematical functions, they eat numbers, not text.
 Say you want to transform a sentence to numbers. You first need to decide how to cut your sentence into small pieces, then map every small piece to a number; this is *tokenization*.
 This was initially done using morpho-syntactic rules ("morpho-syntax" is like the grammar of word creation). Now most people use byte pair encoding (BPE), a smart statistical method to create the sub-words automatically depending on their frequency in a reference text.
 So as a summary: tokenization is a way to map small units of texts (which can be one or several characters, up to the word level) to numbers (similar to an index). When you want to process text, your input text (called a *prompt* at inference) is split into these *tokens* by a tokenizer. The whole range of tokens a model or tokenizer can parse is called its *vocabulary*.
+### Going further: Understanding tokenization
 I advise reading one of the first 2 links in depth.
 - ⭐ [Explanation of different tokenization methods in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter2/4)
 - ⭐ [Conceptual guide about tokenization in the 🤗 doc](https://huggingface.co/docs/transformers/en/tokenizer_summary)
 - [Course by Jurafsky on tokenization (and other things)](https://web.stanford.edu/~jurafsky/slp3/2.pdf) - more academical in its approach, skip to 2.5 and 2.6 (the rest is interesting too but too broad)
+### Going further: Byte Pair Encoding
 - ⭐ [Explanation of BPE in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter6/5)
 - [Paper introducing BPE to NLP](https://aclanthology.org/P16-1162/)
+### Some of the many problems of tokenizations
 ### Choosing the correct vocabulary size
 The size of the vocabulary indicates how many individual tokens (for example, sub-words) the model will have to learn.
 Where the first method splits `similarly` into tokens which have an individual semantic  meaning, it's not the case in the second method: with too small a vocabulary, we lost some semantic representation. The difference in representations length also means that it's many times as costly to generate our word with a smaller vocabulary (takes 9 tokens instead of 2, so 5 times more costly!).
 At the moment, most people seem to use heuristics for vocabulary size, which seems correlated to number of languages covered and model size, so it's likely that using a number of tokens close to the reference models of a similar size could work for you.
+### Going further: Rare tokens effect
 - [SolidGoldMagikarp post on Less Wrong](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)
 	- Very interesting read on how some people identified very rare tokens in Open AI's vocabulary - this is quite cool because it's done without access to the model's internals (we don't know what the training data contains for example)
 - [Fishing for Magikarp, paper by Cohere](https://arxiv.org/abs/2405.05417)
 This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
+### Going further: Language and tokenization
 - ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized)
 	- The breakdown in itself is very clear, and it's worth playing around with the [demo space](https://huggingface.co/spaces/yenniejun/tokenizers-languages)
 - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/)
 ### What about numbers?
 When building your tokenizer, you need to decide what to do about numbers. Do you only index 0 to 9, and assume all other numbers will be compositions of digits, or do you want to store numbers up to, say, one billion, individually? Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. Maybe new approaches to tokenization, such as hierarchical tokenization, might be needed for this.
+### Going further: Number tokenization
 - ⭐ [A nice visual demo by Yennie Jun of how tokenizers of Anthropic, Meta, OpenAI, and Mistral models split numbers](https://www.artfish.ai/p/how-would-you-tokenize-or-break-down)
 - [Small history by Beren Millidge of the evolution of number tokenization through the years](https://www.beren.io/2024-05-11-Integer-tokenization-is-now-much-less-insane/)

app/src/content/chapters/human-evaluation/basics.mdx CHANGED Viewed

@@ -2,24 +2,24 @@
 title: "Human Evaluation: Basics"
 ---
-# Basics
-## What is human evaluation?
-Human evaluation is simply asking humans to evaluate models.
-In this document, we'll look at post-hoc evaluation: your model has been trained, you have a given task in mind, and humans are providing scores.
 ### Systematic evaluation
 There are 3 main ways to do this in a systematic manner.
-If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (eg: `try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not`), and access to one (or several) model(s) that they can interact with, then ask to provide their scores and reasoning.
-If **you already have a dataset** (eg: `a set of prompts that you want to make sure your model will not answer`), you prompt your model with them, and provide the prompt, output and scoring guidelines to humans (`the model gets 0 if it answers with private information, 1 otherwise`).
 Lastly, if **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
 Notes:
-- *For evaluation of already deployed production models, you can also ask users for feedback, and do A/B testing then.*
-- *[AI audits](https://arxiv.org/abs/2401.14462) (external systematic evaluation of models) are usually human based, but out of scope for this document.
 ### Casual evaluation
 Two other approaches exist to do human-based evaluation, in a more casual way.
@@ -28,7 +28,7 @@ Two other approaches exist to do human-based evaluation, in a more casual way.
 **Arenas** are crowdsourced human evaluation to rank models.
 A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best".
-## Pros and cons of human evaluation
 Human evaluation is very interesting for the following reasons:
 - **Flexibility**: If you define clearly enough what you are evaluating, you can get scores for about anything!

 title: "Human Evaluation: Basics"
 ---
+Human evaluation is simply asking humans to evaluate models. In this document, we'll look at post-hoc evaluation: your model has been trained, you have a given task in mind, and humans are providing scores.
 ### Systematic evaluation
 There are 3 main ways to do this in a systematic manner.
+If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with
+- a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*)
+- access to one (or several) model(s) that they can interact with,
+then ask them to provide their scores and reasoning.
+If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans.
 Lastly, if **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
 Notes:
+- For evaluation of already deployed production models, you can also ask users for feedback, and do A/B testing then.
+- [AI audits](https://arxiv.org/abs/2401.14462) (external systematic evaluation of models) are usually human based, but out of scope for this document.
 ### Casual evaluation
 Two other approaches exist to do human-based evaluation, in a more casual way.
 **Arenas** are crowdsourced human evaluation to rank models.
 A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best".
+### Pros and cons of human evaluation
 Human evaluation is very interesting for the following reasons:
 - **Flexibility**: If you define clearly enough what you are evaluating, you can get scores for about anything!

app/src/content/chapters/human-evaluation/tips-and-tricks.mdx CHANGED Viewed

@@ -2,10 +2,10 @@
 title: "Human Evaluation: Tips and tricks"
 ---
-# Tips and tricks
 Here are a few practical tips you might want consider when using human annotators to build an evaluation dataset. If you haven't done so yet, we recommend reading first the page on "Using human annotators" and then come back to this page.
-## Designing the task
 - **Simple is better**: Annotation tasks can get unnecessarily complex, so keep it as simple as possible. Keeping the cognitive load of the annotators to a minimum will help you ensure that they stay focused and make annotations of a higher quality.
@@ -15,13 +15,13 @@ Here are a few practical tips you might want consider when using human annotator
 - **Test the setup**: Once you have your task designed and some guidelines in place, make sure you test it yourself on a few samples before involving the whole team, and iterate as needed.
-## During the annotation
 - **Annotators should work independently**: It's better if annotators don't help each other or see each other's work during the task, as they can propagate their own biases and cause annotation drift. Alignment should always happen through comprehensive guidelines. You may want to train any new team members first on a separate dataset and/or use inter-annotator agreement metrics to make sure the team is aligned.
 - **Consistency is key**: If you make important changes to your guidelines (e.g., changed a definition or instruction, or have added/removed labels), consider if you need to iterate over the annotated data. At least, you should track the changes in your dataset through a metadata value like `guidelines-v1`.
-## Hybrid human-machine annotation
 Sometimes teams face contraints on time and resources but don't want to sacrifice on the pros of human evaluation. In these cases, you may use the help of models to make the task more efficient.
@@ -31,6 +31,6 @@ Sometimes teams face contraints on time and resources but don't want to sacrific
 - **Idenfity edge cases**: For an even faster task, use a jury of models and then have your human supervisor(s) step in where models disagree or there's a tie to break. Again, be aware of the biases discussed in the "Pros and cons of human evaluation".
-## End to end tutorial
 To build you own custom evaluation setup following these tips, you can follow this [practical tutorial](https://github.com/argilla-io/argilla-cookbook/tree/main/domain-eval) from Argilla. It guides you through building a custom evaluation task for your domain, using synthetic data and manual evaluation with [Argilla](https://github.com/argilla-io/argilla/) and [distilabel](https://github.com/argilla-io/distilabel). The guide starts from domain documents and results in a custom evaluation task that you can use to evaluate your model with [lighteval](https://github.com/huggingface/lighteval).

 title: "Human Evaluation: Tips and tricks"
 ---
+### Tips and tricks
 Here are a few practical tips you might want consider when using human annotators to build an evaluation dataset. If you haven't done so yet, we recommend reading first the page on "Using human annotators" and then come back to this page.
+### Designing the task
 - **Simple is better**: Annotation tasks can get unnecessarily complex, so keep it as simple as possible. Keeping the cognitive load of the annotators to a minimum will help you ensure that they stay focused and make annotations of a higher quality.
 - **Test the setup**: Once you have your task designed and some guidelines in place, make sure you test it yourself on a few samples before involving the whole team, and iterate as needed.
+### During the annotation
 - **Annotators should work independently**: It's better if annotators don't help each other or see each other's work during the task, as they can propagate their own biases and cause annotation drift. Alignment should always happen through comprehensive guidelines. You may want to train any new team members first on a separate dataset and/or use inter-annotator agreement metrics to make sure the team is aligned.
 - **Consistency is key**: If you make important changes to your guidelines (e.g., changed a definition or instruction, or have added/removed labels), consider if you need to iterate over the annotated data. At least, you should track the changes in your dataset through a metadata value like `guidelines-v1`.
+### Hybrid human-machine annotation
 Sometimes teams face contraints on time and resources but don't want to sacrifice on the pros of human evaluation. In these cases, you may use the help of models to make the task more efficient.
 - **Idenfity edge cases**: For an even faster task, use a jury of models and then have your human supervisor(s) step in where models disagree or there's a tie to break. Again, be aware of the biases discussed in the "Pros and cons of human evaluation".
+### End to end tutorial
 To build you own custom evaluation setup following these tips, you can follow this [practical tutorial](https://github.com/argilla-io/argilla-cookbook/tree/main/domain-eval) from Argilla. It guides you through building a custom evaluation task for your domain, using synthetic data and manual evaluation with [Argilla](https://github.com/argilla-io/argilla/) and [distilabel](https://github.com/argilla-io/distilabel). The guide starts from domain documents and results in a custom evaluation task that you can use to evaluate your model with [lighteval](https://github.com/huggingface/lighteval).

app/src/content/chapters/human-evaluation/using-human-annotators.mdx CHANGED Viewed

@@ -5,7 +5,7 @@ title: "Using human annotators"
 import bestAnnotationPractices from '../../assets/image/best_annotation_practices.png';
 import Image from '../../../components/Image.astro';
-# Using human annotators
 I suggest reading Section 3 of this [review](https://aclanthology.org/2024.cl-3.1/) of good practices in data annotation quality. If you want production level quality and have the means to implement all of these methods, go ahead!

 import bestAnnotationPractices from '../../assets/image/best_annotation_practices.png';
 import Image from '../../../components/Image.astro';
+### Using human annotators
 I suggest reading Section 3 of this [review](https://aclanthology.org/2024.cl-3.1/) of good practices in data annotation quality. If you want production level quality and have the means to implement all of these methods, go ahead!

app/src/content/chapters/model-as-a-judge/basics.mdx CHANGED Viewed

@@ -2,9 +2,9 @@
 title: "Model as a Judge: Basics"
 ---
-# Basics
-## What is a judge model evaluation?
 Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
 Judge models range from small specialized classifiers (think "spam filter", but for toxicity for example) to LLMs, either large and generalist or small and specialized. In the latter case, when using an LLM as a judge, you give it a prompt to explain how to score models (ex: `Score the fluency from 0 to 5, 0 being completely un-understandable, ...`).
@@ -21,7 +21,7 @@ They are used on 3 main tasks:
 *Note: In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md))*
-## Pros and cons of using judge-LLMs
 Judge LLMs have been used for the following points:
 - **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner
 - **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data.
@@ -33,6 +33,6 @@ There are also downside to all of these:
 - They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
 - They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
-## How to start?
 - If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) (⭐) by Aymeric Roucher on how to setup your first LLM as judge!
 You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.

 title: "Model as a Judge: Basics"
 ---
+### Basics
+### What is a judge model evaluation?
 Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
 Judge models range from small specialized classifiers (think "spam filter", but for toxicity for example) to LLMs, either large and generalist or small and specialized. In the latter case, when using an LLM as a judge, you give it a prompt to explain how to score models (ex: `Score the fluency from 0 to 5, 0 being completely un-understandable, ...`).
 *Note: In this document, I'll focus on the LLMs + prompt approach for now, but you should definitely check out how classifier judges work, as I think it can be fairly robust and well adapted to a number of use cases, and the recently introduced and promising reward model as judge approach (introduced in [this tech report](https://research.nvidia.com/publication/2024-06_nemotron-4-340b), and on which we have a small page [here](https://github.com/huggingface/evaluation-guidebook/blob/main/contents/model-as-a-judge/what-about-reward-models.md))*
+### Pros and cons of using judge-LLMs
 Judge LLMs have been used for the following points:
 - **Objectivity** when compared to humans: They automate empirical judgments in an objective and reproducible manner
 - **Scale and reproducibility**: They are more scalable than human annotators, which allows to reproduce scoring on large amounts of data.
 - They are indeed scalable, but contribute to creating massive amounts of data which themselves need to be examined to ensure their quality (for example, you can improve the quality of LLM-judges by asking them to generate a thinking trace, or reasoning around their data, which makes even more new artificial data to analyse)
 - They are indeed cheap to instantiate, but paying actual expert human annotators is likely to give you qualitatively better results for your specific use cases.
+### How to start?
 - If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) (⭐) by Aymeric Roucher on how to setup your first LLM as judge!
 You can also try the [distilabel](https://distilabel.argilla.io/latest/) library, which allows you to generate synthetic data and update it using LLMs. They have a nice [tutorial](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/ultrafeedback/) applying the methodology of the [Ultrafeedback paper](https://arxiv.org/abs/2310.01377) as well as a [tutorial on benchmarking](https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/) implementing the Arena Hard benchmark.

app/src/content/chapters/model-as-a-judge/designing-your-evaluation-prompt.mdx CHANGED Viewed

@@ -2,9 +2,9 @@
 title: "Designing your evaluation prompt"
 ---
-# Designing your evaluation prompt
-## General prompt design tips
 Some general guidelines I've come across online when designing the prompt itself are:
 - Provide a clear description of the task at hand:
 	- `Your task is to do X`.
@@ -24,7 +24,7 @@ Other tidbits:
 - If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
 - Using one prompt per capability to score tends to give better and more robust results
-## Improving judgment accuracy
 You can also improve accuracy using the following, possibly more costly, techniques:
 - **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
 - **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy

 title: "Designing your evaluation prompt"
 ---
+### Designing your evaluation prompt
+### General prompt design tips
 Some general guidelines I've come across online when designing the prompt itself are:
 - Provide a clear description of the task at hand:
 	- `Your task is to do X`.
 - If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
 - Using one prompt per capability to score tends to give better and more robust results
+### Improving judgment accuracy
 You can also improve accuracy using the following, possibly more costly, techniques:
 - **Few shot examples**: like in many other tasks, if you provide examples it can help its reasoning. However, this adds to your context length.
 - **Reference**: you can also enhance your prompt with a reference if present, which increases accuracy

app/src/content/chapters/model-as-a-judge/evaluating-your-evaluator.mdx CHANGED Viewed

@@ -2,7 +2,7 @@
 title: "Evaluating your evaluator"
 ---
-# Evaluating your evaluator
 Before using a judge-LLM in production or at scale, you want to first evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
@@ -10,12 +10,12 @@ Note: *This will be easier to do if it predicts binary outputs, because you'll b
 So, once you have selected your model judge and its prompt, you'll need to do the following.
-## 1. Pick your baseline
 You'll need to compare your evaluator judgments to a baseline: it can be human annotations, the output of another judge model that you know is qualitative on your task, a gold truth, itself with another prompt, etc.
 You don't necessarily need a lot of examples (50 can be enough), but you need them to be extremely representative of your task, discriminative (representative of edge cases notably), and of as high quality as you can manage.
-## 2. Pick your metric
 Your metric will be used to compare your judge's evaluations with your reference.
 In general, this comparison is considerably easier to do if your model is predicting binary classes or doing pairwise comparison, as you'll be able to compute accuracy (for pairwise comparison), or precision and recall (for binary classes), which are all very easy to interpret metrics.
@@ -24,7 +24,7 @@ Comparing the correlation of scores with human or model scoring will be harder t
 In general, if you're a bit lost about what metrics to pick when (in terms of models, metrics, ...), you can also look at [this interesting graph](https://eugeneyan.com/assets/llm-eval-tree.jpg) from [the same above blog](https://eugeneyan.com/writing/llm-evaluators/) ⭐.
-## 3. Evaluate your evaluator
 For this step, you simply need to use your model and its prompt to evaluate your test samples! Then, once you get the evaluations, use your above metric and reference to compute a score for your evaluations.
 You need to decide what your threshold for acceptance is. Depending on how hard your task is, you can aim for 80% to 95% accuracy, if you're doing pairwise comparison. Regarding correlations (if you're using scores), people in the literature tend to seem happy with 0.8 Pearson correlation with a reference. However, I've seen some papers declare that 0.3 indicates a good correlation with human annotators (^^") so ymmv.

 title: "Evaluating your evaluator"
 ---
+### Evaluating your evaluator
 Before using a judge-LLM in production or at scale, you want to first evaluate its quality for your task, to make sure its scores are actually relevant and useful for you.
 So, once you have selected your model judge and its prompt, you'll need to do the following.
+### 1. Pick your baseline
 You'll need to compare your evaluator judgments to a baseline: it can be human annotations, the output of another judge model that you know is qualitative on your task, a gold truth, itself with another prompt, etc.
 You don't necessarily need a lot of examples (50 can be enough), but you need them to be extremely representative of your task, discriminative (representative of edge cases notably), and of as high quality as you can manage.
+### 2. Pick your metric
 Your metric will be used to compare your judge's evaluations with your reference.
 In general, this comparison is considerably easier to do if your model is predicting binary classes or doing pairwise comparison, as you'll be able to compute accuracy (for pairwise comparison), or precision and recall (for binary classes), which are all very easy to interpret metrics.
 In general, if you're a bit lost about what metrics to pick when (in terms of models, metrics, ...), you can also look at [this interesting graph](https://eugeneyan.com/assets/llm-eval-tree.jpg) from [the same above blog](https://eugeneyan.com/writing/llm-evaluators/) ⭐.
+### 3. Evaluate your evaluator
 For this step, you simply need to use your model and its prompt to evaluate your test samples! Then, once you get the evaluations, use your above metric and reference to compute a score for your evaluations.
 You need to decide what your threshold for acceptance is. Depending on how hard your task is, you can aim for 80% to 95% accuracy, if you're doing pairwise comparison. Regarding correlations (if you're using scores), people in the literature tend to seem happy with 0.8 Pearson correlation with a reference. However, I've seen some papers declare that 0.3 indicates a good correlation with human annotators (^^") so ymmv.

app/src/content/chapters/model-as-a-judge/getting-a-judge-llm.mdx CHANGED Viewed

@@ -2,11 +2,11 @@
 title: "Getting a Judge-LLM"
 ---
-# Getting a Judge-LLM
 When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4),  using [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
-## Using a generalist LLM
 With the introduction of more capable LLMs (such as ChatGPT), some researchers started exploring using big models as judges. The best current big model judges tend to be closed source models (like Claude or gpt-o models) though the gap with open source is closing very fast thanks to high quality models such as [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e), [Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024) or [Llama 3.1-405-Instruct](meta-llama/Llama-3.1-405B-Instruct).
@@ -19,7 +19,7 @@ However, they also allow anyone to have access to a high quality model without n
 You'll find a good cost analysis of model providers [here](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard) if you need help picking one.
-## Using a tiny specialized LLM judge model
 You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
@@ -28,7 +28,7 @@ Some existing models:
 - Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset. A 7B parameter [v2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) also exists, a Mistral-7B-Instruct-v0.2 fine-tune on a bigger synthetic preference dataset, with added weight merging
 - JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models.
-## Training your own
 You can also make the choice to train or fine-tune your own LLM-as-judge.
 You first need to gather preference data for your task of interest, which can come

 title: "Getting a Judge-LLM"
 ---
+### Getting a Judge-LLM
 When using an existing LLM, you can go for [generalist, high capability models](https://arxiv.org/abs/2306.05685v4),  using [small specialist models](https://arxiv.org/abs/2405.01535) trained specifically to discriminate from preference data, or training your own.
+### Using a generalist LLM
 With the introduction of more capable LLMs (such as ChatGPT), some researchers started exploring using big models as judges. The best current big model judges tend to be closed source models (like Claude or gpt-o models) though the gap with open source is closing very fast thanks to high quality models such as [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e), [Command R+](https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024) or [Llama 3.1-405-Instruct](meta-llama/Llama-3.1-405B-Instruct).
 You'll find a good cost analysis of model providers [here](https://huggingface.co/spaces/ArtificialAnalysis/LLM-Performance-Leaderboard) if you need help picking one.
+### Using a tiny specialized LLM judge model
 You can also make the choice to use tiny specialized LLM judges. With often a couple billion parameters, they can run locally on most recent consumer hardware, while being trained from scratch or fine-tuned using instruction data. You often need to follow their specific prompt formats.
 - Prometheus ([weights](https://huggingface.co/prometheus-eval/prometheus-13b-v1.0), [paper](https://arxiv.org/abs/2310.08491)), 13B parameters, a model trained from scratch on synthetic preference dataset. A 7B parameter [v2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0) also exists, a Mistral-7B-Instruct-v0.2 fine-tune on a bigger synthetic preference dataset, with added weight merging
 - JudgeLM ([paper](https://arxiv.org/abs/2310.17631)), 7B to 33B parameters, models trained from scratch on synthetic preference datasets generated with a variety of models.
+### Training your own
 You can also make the choice to train or fine-tune your own LLM-as-judge.
 You first need to gather preference data for your task of interest, which can come

app/src/content/chapters/model-as-a-judge/tips-and-tricks.mdx CHANGED Viewed

@@ -2,9 +2,9 @@
 title: "Model as a Judge: Tips and tricks"
 ---
-# Tips and tricks
-## Mitigating well known biases of LLM as judges:
 - **Lack of internal consistency**: a judge might give you different judgments if you prompt it several times (if the temperature is not 0)
 	- You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
@@ -25,7 +25,7 @@ title: "Model as a Judge: Tips and tricks"
 - **Format bias**: they tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
 	- You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
-## Picking correct tasks for an LLM judge
 LLM evaluators:
 - are **bad at identifying hallucinations** in general, particularly what are called partial hallucinations (which look close to the ground truth but are actually slightly different) (see [this](https://arxiv.org/abs/2305.11747) and [this](https://arxiv.org/abs/2303.08896))

 title: "Model as a Judge: Tips and tricks"
 ---
+### Tips and tricks
+### Mitigating well known biases of LLM as judges:
 - **Lack of internal consistency**: a judge might give you different judgments if you prompt it several times (if the temperature is not 0)
 	- You can mitigate this by doing self-consistency prompting of your judge, prompting it multiple times and keeping the majority output
 - **Format bias**: they tend to fail to evaluate accurately if the prompt format [is too far away](https://arxiv.org/abs/2310.17631) from what it's been trained with. For example, a model trained to do pairwise comparison with an added reference answer will fail if said answer is not provided, and failures will also occur the other way around.
 	- You can mitigate this by paying attention to the training prompt format (if the model was instruction tuned) and ensuring you follow it.
+### Picking correct tasks for an LLM judge
 LLM evaluators:
 - are **bad at identifying hallucinations** in general, particularly what are called partial hallucinations (which look close to the ground truth but are actually slightly different) (see [this](https://arxiv.org/abs/2305.11747) and [this](https://arxiv.org/abs/2303.08896))

app/src/content/chapters/model-as-a-judge/what-about-reward-models.mdx CHANGED Viewed

@@ -2,9 +2,9 @@
 title: "What about Reward Models?"
 ---
-# What about Reward Models?
-## What is a Reward Model?
 Reward models learn to predict a score from human annotations for given prompt/completion pairs. The end goal is for them to do predictions aligned with human preference.
 Once trained, these models can then be used to improve other models, by acting as a a reward function which is a proxy for human judgment.
@@ -28,7 +28,7 @@ Some reward models such as [SteerLM](https://arxiv.org/abs/2311.09528) output ab
 More recently, models have been proposed that output both absolute and relative scores, such as [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) and [ArmoRM](https://arxiv.org/abs/2406.12845).
-## How do I use a Reward Model for Evaluation?
 Given a dataset of prompts, we can generate completions from a language model and ask a reward model to score them.
@@ -40,7 +40,7 @@ Instead, we can use
 - win rates: take a reference set of completions and calculate the percentage of completions from the model that are ranked higher than the reference completions. It is slightly more granular.
 - win probabilities: the mean probability of the completions being better than the reference completions, which can give a more fine-grained and smoothly changing signal.
-## Pros and Cons of Reward Models
 Reward models are typically:
 - **Very fast**: Getting a score is as simple as running a forward pass of a relatively small model once (since we only get a score, and not long text, contrary to judge-LLMs)
@@ -52,7 +52,7 @@ On the other hand they:
 - **Require specific fine-tuning**: This can be a relatively costly step, and elthough they inherit many capabilities from a base model, they may still perform poorly on tasks that are out of the training distribution.
 - **Loose efficiency when used both in reinforcement learning and evaluation** (or when using direct alignment algorithms on datasets that are similar to the training data of the reward model), as the language model may overfit to the reward model's preferences.
-## Tips and Tricks for using Reward Models for Evaluation
 - A good place to find high performing models is the [RewardBench Leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
 - You can look at how reward models have been used in the [Nemotron](https://arxiv.org/abs/2406.11704) paper.

 title: "What about Reward Models?"
 ---
+### What about Reward Models?
+### What is a Reward Model?
 Reward models learn to predict a score from human annotations for given prompt/completion pairs. The end goal is for them to do predictions aligned with human preference.
 Once trained, these models can then be used to improve other models, by acting as a a reward function which is a proxy for human judgment.
 More recently, models have been proposed that output both absolute and relative scores, such as [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) and [ArmoRM](https://arxiv.org/abs/2406.12845).
+### How do I use a Reward Model for Evaluation?
 Given a dataset of prompts, we can generate completions from a language model and ask a reward model to score them.
 - win rates: take a reference set of completions and calculate the percentage of completions from the model that are ranked higher than the reference completions. It is slightly more granular.
 - win probabilities: the mean probability of the completions being better than the reference completions, which can give a more fine-grained and smoothly changing signal.
+### Pros and Cons of Reward Models
 Reward models are typically:
 - **Very fast**: Getting a score is as simple as running a forward pass of a relatively small model once (since we only get a score, and not long text, contrary to judge-LLMs)
 - **Require specific fine-tuning**: This can be a relatively costly step, and elthough they inherit many capabilities from a base model, they may still perform poorly on tasks that are out of the training distribution.
 - **Loose efficiency when used both in reinforcement learning and evaluation** (or when using direct alignment algorithms on datasets that are similar to the training data of the reward model), as the language model may overfit to the reward model's preferences.
+### Tips and Tricks for using Reward Models for Evaluation
 - A good place to find high performing models is the [RewardBench Leaderboard](https://huggingface.co/spaces/allenai/reward-bench).
 - You can look at how reward models have been used in the [Nemotron](https://arxiv.org/abs/2406.11704) paper.

app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx CHANGED Viewed

@@ -2,9 +2,9 @@
 title: "Troubleshooting inference"
 ---
-# Troubleshooting inference
-## My model is very slow!
 ### Changing the batch size
 If you want absolute reproducibility (given a specific hardware and a specific evaluation prompt), you're probably using a batch size of one. However, moving to higher batch sizes will likely make your evaluation faster (given that it fits within the memory requirements of your hardware)
@@ -19,7 +19,7 @@ Not all inference libraries run at the same speed, and some code is more optimiz
 ### Changing the precision
 If your model is very slow, you can reduce its size by reducing the precision of the computations. A model stored in float32 does very precise computations (using 32bits per number stored!) that are also very memory and compute heavy - moving to `blfoat16` or `float16` (half the precision) should make the model twice as fast at a loss of precision which should almost not matter. If you want bumps in speed, you can quantize it even more, to 8 or 4 bits (using `gptq` or `bitsandbytes` for example), as n-bit matrix computations should be faster and your model will take even less space in memory (however, some quantization libraries might be a bit slow, so test things out for your use cases!).
-## My model is very big!
 ### Estimating memory requirements
 You can estimate the minimal theoretical memory required to load a given model (and therefore hardware) with the **following formula**:
@@ -32,10 +32,10 @@ And that's it!
 I would actually recommend using  `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
 ### What should you do if your model does not fit on a GPU?
-#### Quantization
 The first obvious thing is to play on the `<precision factor>` above: going from float32 to 4 bits reduces memory requirements by 8!
 However, using too low a precision can give worse results, so for some models (especially medium range), you might want to stay in float16 or 8bit. (Quantization seems to affect very big models performance less, possibly because of information redundancy).
-#### Model parallelism
 Model parallelism includes a range of techniques which cut your model in smaller sub-model pieces, to load and run each of these smaller pieces on a single different GPU. This requires less memory since you never load the full model at once, but can be slower.
 The 2 main types of model parallelism are
@@ -44,7 +44,7 @@ The 2 main types of model parallelism are
 The best document on the different kinds of parallelism (including data parallelism, for speedups) is [here](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism).
-#### CPU offloading
 CPU offloading moves some of the computations and models parts to CPU, in order to reduce GPU memory usage. It's **considerably slower** than any other method here, mostly because you need to move data from one device to another all the time.
 An example of this is [ZeRO-Offload](https://arxiv.org/abs/2101.06840) by Deepspeed, which distributes parameters between CPU and GPU (on top of using other optimization described in the ZeRO-2 paper). On CPU are passed gradients, optimizer states and fp32 model parameter computations during optimisation, whereas on GPU, you'll find fp16 parameters and forward/backward pass, to leverage CPU memory used and GPU computations while minimizing communication between both.

 title: "Troubleshooting inference"
 ---
+### Troubleshooting inference
+### My model is very slow!
 ### Changing the batch size
 If you want absolute reproducibility (given a specific hardware and a specific evaluation prompt), you're probably using a batch size of one. However, moving to higher batch sizes will likely make your evaluation faster (given that it fits within the memory requirements of your hardware)
 ### Changing the precision
 If your model is very slow, you can reduce its size by reducing the precision of the computations. A model stored in float32 does very precise computations (using 32bits per number stored!) that are also very memory and compute heavy - moving to `blfoat16` or `float16` (half the precision) should make the model twice as fast at a loss of precision which should almost not matter. If you want bumps in speed, you can quantize it even more, to 8 or 4 bits (using `gptq` or `bitsandbytes` for example), as n-bit matrix computations should be faster and your model will take even less space in memory (however, some quantization libraries might be a bit slow, so test things out for your use cases!).
+### My model is very big!
 ### Estimating memory requirements
 You can estimate the minimal theoretical memory required to load a given model (and therefore hardware) with the **following formula**:
 I would actually recommend using  `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
 ### What should you do if your model does not fit on a GPU?
+### Quantization
 The first obvious thing is to play on the `<precision factor>` above: going from float32 to 4 bits reduces memory requirements by 8!
 However, using too low a precision can give worse results, so for some models (especially medium range), you might want to stay in float16 or 8bit. (Quantization seems to affect very big models performance less, possibly because of information redundancy).
+### Model parallelism
 Model parallelism includes a range of techniques which cut your model in smaller sub-model pieces, to load and run each of these smaller pieces on a single different GPU. This requires less memory since you never load the full model at once, but can be slower.
 The 2 main types of model parallelism are
 The best document on the different kinds of parallelism (including data parallelism, for speedups) is [here](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism).
+### CPU offloading
 CPU offloading moves some of the computations and models parts to CPU, in order to reduce GPU memory usage. It's **considerably slower** than any other method here, mostly because you need to move data from one device to another all the time.
 An example of this is [ZeRO-Offload](https://arxiv.org/abs/2101.06840) by Deepspeed, which distributes parameters between CPU and GPU (on top of using other optimization described in the ZeRO-2 paper). On CPU are passed gradients, optimizer states and fp32 model parameter computations during optimisation, whereas on GPU, you'll find fp16 parameters and forward/backward pass, to leverage CPU memory used and GPU computations while minimizing communication between both.

app/src/content/chapters/troubleshooting/troubleshooting-math-parsing.mdx CHANGED Viewed

@@ -6,7 +6,7 @@ import sympyDoc from '../../assets/image/sympy_doc.png';
 import lmEvalDiff from '../../assets/image/lm_eval_diff.png';
 import Image from '../../../components/Image.astro';
-# Using LaTeX to evaluate MATH capabilities
 Parsing latex is hard. This is an issue when evaluating a model expecting $\LaTeX$ as output. This is the case for the [MATH benchmark](https://huggingface.co/datasets/lighteval/MATH).

 import lmEvalDiff from '../../assets/image/lm_eval_diff.png';
 import Image from '../../../components/Image.astro';
+### Using LaTeX to evaluate MATH capabilities
 Parsing latex is hard. This is an issue when evaluating a model expecting $\LaTeX$ as output. This is the case for the [MATH benchmark](https://huggingface.co/datasets/lighteval/MATH).

app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx CHANGED Viewed

@@ -2,12 +2,12 @@
 title: "Troubleshooting reproducibility"
 ---
-# Troubleshooting reproducibility
 Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
 Let's explore why.
-## Different code base
 To reproduce evaluation scores to the decimal point, you first need to make sure you're using exactly the same code base as the paper you want to reproduce.
 Usually, this means either using the evaluation default code as provided by the authors, or a standard implementation in a reference library like Eleuther's AI `lm_eval` or HuggingFace's `lighteval`. However, if the code source for evaluation is not provided, then, I'm sorry for you but it's unlikely that you'll be able to reproduce the results precisely.
@@ -32,7 +32,7 @@ We've observed that the following were easy things to mess up, even when using t
 	  (The `lm_eval` v2 now includes the normalization name in most metric names.)
 	 -> This is one of the easiest things to mess up, especially for tasks which require a lot of normalization/answer post processing, like math evaluations (where you want to extract the answer from a generated explanation).
-## Different prompt
 3 main things can come into play for prompt variation.
 ### Prompt itself
 The format you are using for the prompt can and will change scores wildly.
@@ -76,13 +76,13 @@ However, you also need to use the **exact same samples** as the model you are co
 This is also a place where paying attention to the random seeds is important.
-## Different generation parameters
 For generative evaluations, parameters to pay attention to are:
 - making sure you are using the **same end of sentence token**
 - making sure you are allowing your model to **generate the same number of tokens** for the evaluation
 - making sure, if using sampling, that you are using the **same seed/temperature parameters**
-## Different model loading
 Some sources of differences that we have observed are:
 - using **different hardware**.
   Pytorch does not ensure reproducibility of non deterministic operations across hardware

 title: "Troubleshooting reproducibility"
 ---
+### Troubleshooting reproducibility
 Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
 Let's explore why.
+### Different code base
 To reproduce evaluation scores to the decimal point, you first need to make sure you're using exactly the same code base as the paper you want to reproduce.
 Usually, this means either using the evaluation default code as provided by the authors, or a standard implementation in a reference library like Eleuther's AI `lm_eval` or HuggingFace's `lighteval`. However, if the code source for evaluation is not provided, then, I'm sorry for you but it's unlikely that you'll be able to reproduce the results precisely.
 	  (The `lm_eval` v2 now includes the normalization name in most metric names.)
 	 -> This is one of the easiest things to mess up, especially for tasks which require a lot of normalization/answer post processing, like math evaluations (where you want to extract the answer from a generated explanation).
+### Different prompt
 3 main things can come into play for prompt variation.
 ### Prompt itself
 The format you are using for the prompt can and will change scores wildly.
 This is also a place where paying attention to the random seeds is important.
+### Different generation parameters
 For generative evaluations, parameters to pay attention to are:
 - making sure you are using the **same end of sentence token**
 - making sure you are allowing your model to **generate the same number of tokens** for the evaluation
 - making sure, if using sampling, that you are using the **same seed/temperature parameters**
+### Different model loading
 Some sources of differences that we have observed are:
 - using **different hardware**.
   Pytorch does not ensure reproducibility of non deterministic operations across hardware