evaluation-guidebook

Running

App Files Files Community

guipenedo HF Staff commited on 5 days ago

Commit

ff1633b

verified ·

1 Parent(s): f69c124

Update app/src/content/chapters/intro.mdx

Browse files

Files changed (1) hide show

app/src/content/chapters/intro.mdx +19 -15

app/src/content/chapters/intro.mdx CHANGED Viewed

@@ -9,15 +9,15 @@ import Quote from "../../components/Quote.astro";
 ## What is model evaluation about?
-As you navigate the world of LLMs—whether you're training or fine-tuning your own models, selecting one for your application, or trying to understand the state of the field— there is one question you have likely asked stumbled upon:
 <Quote>
-How can one know if a model is *good at* anything?
 </Quote>
 The answer is (surprisingly given the blog topic) evaluation! It's everywhere: leaderboards ranking models, benchmarks claiming to measure *reasoning*, *knowledge*, *coding abilities* or *math performance*, papers announcing new state-of-the-art results...
-But what is it, really? And what can really it tell you?
 This guide is here to help you understand evaluation: what it can and cannot do and when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want.
@@ -25,33 +25,37 @@ Through the guide, we'll also highlight common pitfalls, tips and tricks from th
 Before we dive into the details, let's quickly look at why people do evaluation, as who you are and what you are working on will determine which evaluations you need to use.
-### The model builder perspective: Is this model training correctly?
-**Non-regression testing** is a concept which comes from the software industry, to make sure small changes have not broken the overall approach. The idea is the following: when you add a new feature to your software, or fix a problem in the code base, have you broken something else? That's what non-regression tests are for: making sure the expected, high-level behavior of your software is not suddenly broken by a (seemingly unrelated) change.
-When you select a setup to train models, you want to test something very similar, and make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties.
-In ML, experiments which test the impact of small changes on model performance are known as **ablations**, and whether you'll learn anything from them or not relies entirely on your evaluations: with a strong enough signal while relatively cheap to run as you'll be running them **a lot**.
-For ablations, you also need to look at both **trajectories** (is the performance better now than when training started) and score **ranges** (is the performance within what's expected). These evaluations are here to confirm that your approach is "as sound or better than" other training approaches, and that your model behaves in similar ways. <Sidenote> Ablations can also be used to try to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws. </Sidenote>
 ### The model user perspective: Which model is the best on \<task\>?
-The next role of evaluation is simply to sort models to find and select the best model for a given use case.
 For common topics like math, code, or knowledge, there are likely several leaderboards comparing and ranking models using different datasets, and you usually just have to test the top contenders to find the best model for you (if they are not working for you, it's unlikely the next best models will work).
-You could want to run the evaluation and comparision yourself (by reusing existing benchmarks) to get more details to analyse on the model successes and failures, which we will cover below.
 <Sidenote>
 In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benchmarking and dataset design from the ImageNet era, the authors argue that, since scores are susceptible to instability, the only robust way to evaluate models is through rankings, and more specifically by finding broad groups of evaluations which provide consistent and stable rankings. I believe looking for ranking stability is indeed an extremely interesting approach to model benchmarking, as we have shown that LLMs *scores* on automated benchmarks are extremely susceptible to [minute changes in prompting](https://huggingface.co/blog/evaluation-structured-outputs), and that human evaluations are not more consistent - where *rankings* are actually more stable when using robust evaluation methods.
 </Sidenote>
-For less common topics, you might even need to think about designing your own evaluations, which is our last section.
-<Note title="Small caveat">
-Despite often grandiose claims, for any complex capability, we cannot at the moment just say "this model is the best at this", but should instead say **"this model is the best on these samples for this specific task that we hope are a good proxy for this capability, without any guarantee"**.
-</Note>
 <Note title="What about measuring AGI?">
 We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.

 ## What is model evaluation about?
+As you navigate the world of LLMs — whether you're training or fine-tuning your own models, selecting one for your application, or trying to understand the state of the field — there is one question you have likely asked stumbled upon:
 <Quote>
+How can one know if a model is *good*?
 </Quote>
 The answer is (surprisingly given the blog topic) evaluation! It's everywhere: leaderboards ranking models, benchmarks claiming to measure *reasoning*, *knowledge*, *coding abilities* or *math performance*, papers announcing new state-of-the-art results...
+But what is it, really? And what can it really tell you?
 This guide is here to help you understand evaluation: what it can and cannot do and when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want.
 Before we dive into the details, let's quickly look at why people do evaluation, as who you are and what you are working on will determine which evaluations you need to use.
+### The model builder perspective: Am I building a strong model?
+If you are a researcher or engineer creating a new model, your goal is likely to build a strong model that performs well on a set of tasks. For a base model (training from scratch), you want the model to do well on a general tasks, measuring a variety of different capabilities. If you are post-training a base model for a specific use case, you probably care more about the performance on that specific task. The way you measure performance, in either case, is through evaluations.
+As you experiment with different architectures, data mixtures, and training recipes, you want to make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties, and possibly even improved it. The way you test for the impact of different design choices is through **ablations**: an ablation is an experiment where you typically train a model under a specific setup, evaluate it on your chosen set of tasks, and compare the results to a baseline model.
+Therefore, the choice of evaluation tasks is critical for ablations, as they determine what you will be optimizing for as you create your model.
+For base models, one would typically resort to selecting standard benchmark tasks used by other model builders (think the classic list of benchmarks that are always reported when a new model is released). For a specific use case, you can either use existing evaluation tasks if they are available -- and you likely will want to take a good look if they are not "standard" -- or design your own (discussed below). As you will likely run a lot of ablations, you want the evaluation tasks to provide strong enough signal (and not just meaningless noisy results) and you want them to run cheaply and quickly, so that you can iterate fast.
+Through ablations, we are also able to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws.
+Besides ablations for experiments, you will likely also want to run evaluations on intermediate checkpoints as your model is training, to ensure it is properly learning and improving at the different tasks, and does not start regressing due to spikes or other issues. Finally, you want to evaluate the final checkpoint so that you can announce that your model is SOTA when you release it.
+<Note title="Small caveat">
+Despite often grandiose claims, for any complex capability, we cannot at the moment just say "this model is the best at this", but should instead say **"this model is the best on these samples for this specific task that we hope are a good proxy for this capability, without any guarantee"**.
+</Note>
+(You can still claim you are SOTA, just keep the caveat in mind.)
 ### The model user perspective: Which model is the best on \<task\>?
+You want to use a model someone else trained for your specific use case, without performing additional training, or maybe you will perform additional training and are looking for the best existing model to use as a base.
 For common topics like math, code, or knowledge, there are likely several leaderboards comparing and ranking models using different datasets, and you usually just have to test the top contenders to find the best model for you (if they are not working for you, it's unlikely the next best models will work).
+You could want to run the evaluation and comparisons yourself (by reusing existing benchmarks) to get more details to analyse on the model successes and failures, which we will cover below.
 <Sidenote>
 In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benchmarking and dataset design from the ImageNet era, the authors argue that, since scores are susceptible to instability, the only robust way to evaluate models is through rankings, and more specifically by finding broad groups of evaluations which provide consistent and stable rankings. I believe looking for ranking stability is indeed an extremely interesting approach to model benchmarking, as we have shown that LLMs *scores* on automated benchmarks are extremely susceptible to [minute changes in prompting](https://huggingface.co/blog/evaluation-structured-outputs), and that human evaluations are not more consistent - where *rankings* are actually more stable when using robust evaluation methods.
 </Sidenote>
+Similarly to model builders hillclimbing a specific capability, for less common topics, you might need to think about designing your own evaluations, which is detailed in our last section.
 <Note title="What about measuring AGI?">
 We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.