evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 6 days ago

Commit

a1e35e5

1 Parent(s): 3d13ae7

editing text

Browse files

Files changed (9) hide show

app/src/content/article.mdx +19 -8
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +32 -0
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx +57 -103
app/src/content/chapters/general-knowledge/picking-your-evaluation.mdx +3 -3
app/src/content/chapters/intro.mdx +25 -32
app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx +0 -81
app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx +0 -2
app/src/content/embeds/d3-tokenization-timeline.html +266 -0
app/src/content/embeds/d3-tokenization.html +168 -0

app/src/content/article.mdx CHANGED Viewed

@@ -24,7 +24,6 @@ import Intro from "./chapters/intro.mdx";
 import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx";
 import PickingYourEval from "./chapters/general-knowledge/picking-your-evaluation.mdx";
 import EvalsIn2025 from "./chapters/general-knowledge/2025-evaluations-for-useful-models.mdx"
-import TroubleshootingInference from "./chapters/troubleshooting/troubleshooting-inference.mdx";
 import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
 import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
@@ -32,19 +31,19 @@ import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-infe
 ## LLM basics to understand evaluation
-Now that you have an idea of why evaluation is important, and how it's done, let's look at how we prompt models to get some answers out in order to evaluate them. It's very likely you can skim this section if you have already done evaluation.
 <ModelInferenceAndEvaluation />
 ## Evaluating with existing benchmarks
 ### Benchmarks to know in 2025
 <EvalsIn2025 />
-### Selecting good benchmarks automatically for model training
-<PickingYourEval />
 ### Understanding what's in there
@@ -66,11 +65,16 @@ In other words, is your dataset consistent?
 #### Samples inspection
 Take 50 random samples and manually inspect them; and I mean do it yourself, not "prompt an LLM to find unusual stuff in the data for you".
-First, you want to check the content quality. Are the prompts clear and unambiguous? Are the answers correct? (*Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.*) Is information missing? (*Eg: MMLU misses reference schematics in a number of questions.*) It's important to keep in mind that it's not because a dataset is a standard that it's a good one - and this happens because most people skip this step.
 Then, you want to check for relevance to your task. Are these questions the kind of questions you want to evaluate an LLM on? Are these examples relevant to your use case?
-You might also want to check the samples consistency (especially if you're planning on using few shots or computing aggregated statistics): do all samples have the same number of choices if it's a multiple choice evaluation? Is the spacing consistent before and after the prompt? If your evaluation comes with an additional environment, ideally you want to use it to understand tool calls.
 Lastly, you also want to quickly check how many samples are present there (to make sure results are statistically significant - 100 samples is usually a minimum for automatic benchmarks).
@@ -84,16 +88,23 @@ You want to check what metrics are used: are they automatic, functional, or usin
 <TroubleshootingReproducibility />
 ## Creating your own evaluation
 <DesigningAutomaticEvaluation />
-<TroubleshootingInference />

 import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx";
 import PickingYourEval from "./chapters/general-knowledge/picking-your-evaluation.mdx";
 import EvalsIn2025 from "./chapters/general-knowledge/2025-evaluations-for-useful-models.mdx"
 import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
 import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";
 ## LLM basics to understand evaluation
+Now that you have a better (but broad) idea of why evaluation is important and how it's done, let's look at how we prompt models to get some answers out in order to evaluate them. You can skim this section if you have already done evaluation and mostly look for the notes and sidenotes.
 <ModelInferenceAndEvaluation />
 ## Evaluating with existing benchmarks
+Now that you've gotten (re)acquainted with required basics on how tokenization and inference work, and what are the caveats when doing evalution, let's look at actual benchmarking! We'll first do a small tour of 2025 evaluations, then discuss what to look at in a benchamrk, and why you probably can't reproduce announcements scores. Lastly, we'll cover the special case of selecting good benchmark to evaluate training with the FineWeb team
 ### Benchmarks to know in 2025
 <EvalsIn2025 />
 ### Understanding what's in there
 #### Samples inspection
 Take 50 random samples and manually inspect them; and I mean do it yourself, not "prompt an LLM to find unusual stuff in the data for you".
+First, you want to check the content quality.
+- Are the prompts clear and unambiguous?
+- Are the answers correct? (*Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.*)
+- Is information missing? (*Eg: MMLU references absent schematics in a number of questions.*)
+It's important to keep in mind that it's not because a dataset is a standard that it's a good one - and this happens because most people skip this step.
 Then, you want to check for relevance to your task. Are these questions the kind of questions you want to evaluate an LLM on? Are these examples relevant to your use case?
+You might also want to check the samples consistency (especially if you're planning on using few shots or computing aggregated statistics): do all samples have the same number of choices if it's a multiple choice evaluation? Is the spacing consistent before and after the prompt? If your evaluation comes with an additional environment, ideally you want to use it to understand what gets called.
 Lastly, you also want to quickly check how many samples are present there (to make sure results are statistically significant - 100 samples is usually a minimum for automatic benchmarks).
 <TroubleshootingReproducibility />
+### Selecting good benchmarks automatically for model training
+<PickingYourEval />
 ## Creating your own evaluation
+At this stage, you likely have a good idea of why people do evaluation, which benchmarks exist and are relevant for different model stages (training, inference of base and tuned models), but what if nothing exists for your specific use case?
+This is precisely when you could want to create your own evaluation.
 <DesigningAutomaticEvaluation />
+## Conclusion

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -496,6 +496,38 @@ On the other hand they:
 - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
 </Note>
 ### Calibration and confidence
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.

 - Tracking of win rates or probabilities over training, e.g. as in [this](https://arxiv.org/abs/2410.11677v1) paper, can allow you to detect model degradation and select optimal checkpoints.
 </Note>
+### Constraining model outputs
+In a number of cases, we might want the model to output a prediction which follows a very specific format to simplify evaluation.
+#### Using a prompt
+The easiest way to do this is to add a task prompt which contains very specific instructions as to how the model should answer (`Provide numerical answers in digits.`,`Use no abbreviation.`, etc).
+It won't necessarily work all the time but should be good enough for high capability models. That's the approach we followed in the [GAIA](https://huggingface.co/papers/2311.12983) paper for example.
+#### Few shots and in context learning
+The next way to do so is to constrain the model through what is called "in context learning". By providing examples in the prompt (what is called `few-shot prompting`), the model is implicitly biased towards following the repeated prompt shape for the actual sample.
+<Note>
+It's a method which was overall working quite well until end of 2023!
+However, the widespread adoption of instruction-tuning methods and the addition of instruction data in later stages of model pre-training (continuous pre-training) has biased more recent models towards specific output formats (what is being called [here](https://arxiv.org/abs/2407.07890) *Training on the test task*, and what I would call *overfitting the prompt format*). Reasoning models are also not playing that well with few shot examples because of the reasoning trace.
+It's also a method which can be limited for older models with smaller context sizes, as some few-shot examples can not fit into the context window.
+</Note>
+#### Structured text generation
+Structured text generation constrains the outputs to follow a given path, defined by a grammar or by regular expressions, for example. The `outlines` library implements this using finite state machines, which is very neat. (Other approaches exist, such as using interleaved generation for json generation, but the FSM one is my favorite).
+To understand more about what happens when using structured generation, you can check the [blog](https://huggingface.co/blog/evaluation-structured-outputs) we wrote together: structured generation reduce prompt variance in evaluation, and make results and rankings more stable. You can also check the overall `outlines` [blog](https://blog.dottxt.co/) for interesting implementations and observations linked to structured generation.
+However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show that structured generation can lower model performance on some tasks (like reasoning), by moving the prior too far away from the expected probability distribution.
+<Note title="Going further" emoji="📚" variant="warning">
+-  ⭐ [Understanding how Finite State Machine when using structured generation](https://blog.dottxt.co/coalescence.html), by Outlines. Super clear guide on how their method works!
+- [The outlines method paper](https://arxiv.org/abs/2307.09702), a more academic explanation of the above
+- [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
+</Note>
 ### Calibration and confidence
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.

app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED Viewed

@@ -10,12 +10,16 @@ import Image from '../../../components/Image.astro';
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 import Accordion from "../../../components/Accordion.astro";
-Current large language model work in a simple way: given some text as input, they have learned to predict plausible follow up.
-This is done in two steps.
 ### Tokenization
-The input text (called a *prompt* at inference) is first split into *tokens*, small units of texts (which can be one or several characters, up to the word level) each associated with a number. The whole range of tokens a model can parse is called its *vocabulary*. This section is quite basic so feel free to skip it if needed!
 #### Basics of tokenization: Why and how do we tokenize text?
 Since large language models are actually big mathematical functions, they eat numbers, not text.
@@ -31,63 +35,27 @@ Some people therefore had the idea to cut words into sub-words, and assign index
 This was initially done using morpho-syntactic rules (*morpho-syntax* is like the grammar of word creation). Now most people use byte pair encoding (BPE), a smart statistical method to create the sub-words automatically depending on their frequency in a reference text.
-So as a summary: tokenization is a way to map small units of texts (which can be one or several characters, up to the word level) to numbers (similar to an index). When you want to process text, your input text (called a *prompt* at inference) is split into these *tokens* by a tokenizer. The whole range of tokens a model or tokenizer can parse is called its *vocabulary*.
 <Note title="Going further: Understanding tokenization" emoji="📚" variant="warning">
 - ⭐ [Explanation of different tokenization methods in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter2/4)
 - ⭐ [Conceptual guide about tokenization in the 🤗 doc](https://huggingface.co/docs/transformers/en/tokenizer_summary)
-- [Course by Jurafsky on tokenization (and other things)](https://web.stanford.edu/~jurafsky/slp3/2.pdf) - academic approach, skip to 2.5 and 2.6 (the rest is interesting too but too broad)
 </Note>
 <Note title="Going further: Byte Pair Encoding" emoji="📚" variant="warning">
-- ⭐ [Explanation of BPE in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter6/5)
-- [Paper introducing BPE to NLP](https://aclanthology.org/P16-1162/)
-</Note>
-#### Using your own tokenizer? Don't forget to consider the following
-I recommend making sure you understand BPE before this section, see above for some references!
-**Choosing the correct vocabulary size**
-The size of the vocabulary indicates how many individual tokens (for example, sub-words) the model will have to learn. A vocabulary which is **too big** might contain some very rare words as full tokens (for example: `aardvark`), which can lead to 2 problems. If such a rare word almost never appears in the training data, it can be hard to connect to other concepts, and the model might be unable to infer what it is about. On the other hand, if it appears rarely and only in specific contexts, it can be linked to some very specific other words: for example, if you train on forum data, and your tokenizer mapped a username as one single token in its vocabulary, your model might then associate this token to the specific user's content.
-A vocabulary which is **too small** will present 2 other problems: worst representation capabilities, and increased cost at inference.
-Let's go back to our above example, where we tokenized words derived from `similar`. Using a pseudo BPE approach (large vocabulary) to tokenize `similarly` has split the word into 2 tokens (`similar`, `ly`). If we had used instead character level tokenization (therefore with a very small vocabulary, the size of an alphabet), the same word would be cut into 9 tokens (`s`, `i`, `m`, `i`, `l`, `a`, `r`, `l`, `y`). Where the first method splits `similarly` into tokens which have an individual semantic  meaning, it's not the case in the second method: with too small a vocabulary, we lost some semantic representation. The difference in representations length also means that it's many times as costly to generate our word with a smaller vocabulary (takes 9 tokens instead of 2, so 5 times more costly!).
-At the moment, most people seem to use heuristics for vocabulary size, which seems correlated to number of languages covered and model size, so it's likely that using a number of tokens close to the reference models of a similar size could work for you.
-<Note title="Going further: Rare tokens effect" emoji="📚">
-- [SolidGoldMagikarp post on Less Wrong](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation): Very interesting read on how some people identified very rare tokens in Open AI's vocabulary - this is quite cool because it's done without access to the model's internals (we don't know what the training data contains for example)
-- [Fishing for Magikarp, paper by Cohere](https://arxiv.org/abs/2405.05417): Follow up work on to detect these tokens
 </Note>
-**Managing several languages**
-When building or choosing your tokenizer, you construct your vocabulary from reference text. This means that your tokenizer will know vocabulary words and characters from this reference text. Usually, it means using data in English, with a Latin script.
-If you want to add new language, and your new language uses the same script and share some roots, you could theoretically hope that some of your original language semantics transfer to the new language.
-However, if you want to allow your tokenizer to correctly split text in other languages (especially languages written in other scripts) you'd better include data from these languages when building said tokenizer. Most of the time, though, this data will contain an unbalanced proportion of the initial language (ex: English) to the new language (ex: Thai, or Burmese), the initial language being much more present. Since most efficient tokenizer methods used nowadays (like BPE) create their complex vocabulary tokens based on the most frequent words seen, most of the long tokens will be English words - and most of the words from the less frequent languages will only be split at the character level.
-This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
-<iframe
-    src="https://OpenEvals-tokenizers-languages.hf.space"
-    frameborder="0"
-    width="850"
-    height="450"
-></iframe>
-<Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
-- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and the embedded space comes from her work.
-- ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
-</Note>
-**What about numbers?**
-When building your tokenizer, you need to decide what to do about numbers. Do you only index 0 to 9, and assume all other numbers will be compositions of digits, or do you want to store numbers up to, say, one billion, individually? Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. Maybe new approaches to tokenization, such as hierarchical tokenization, might be needed for this.
-<Note title="Going further: Number tokenization" emoji="📚" variant="warning">
 - ⭐ [A nice visual demo by Yennie Jun of how tokenizers of Anthropic, Meta, OpenAI, and Mistral models split numbers](https://www.artfish.ai/p/how-would-you-tokenize-or-break-down)
 - [Small history by Beren Millidge of the evolution of number tokenization through the years](https://www.beren.io/2024-05-11-Integer-tokenization-is-now-much-less-insane/)
 </Note>
@@ -95,14 +63,16 @@ When building your tokenizer, you need to decide what to do about numbers. Do yo
 #### How tokenization can mess up your evaluation
 **Managing fine-tuned models, system prompts and chat templates**
-Pre-2022, models used to simply be pretrained: text in, text out, nothing else. Then, we got instruction tuning and chat models in 2023, and in 2025 reasoning models. This means that we went from using text "as is" to using chat templates (= providing models with json) to using reasoning tags (= mixing up the json chat template with xml tags for reasoning).
-This means a number of models are going to perform terribly if you do not make sure to:
-1. add their system prompt at the very beginning of inference
-2. prompt them using a chat template if they require it (usually adding `Assistant` and `User` prefixes to the dialogue turns - learn more about this in [this cool guide](https://huggingface.co/docs/transformers/main/en/chat_templating))
-3. remove the thinking trace from the model answer before processing it (you can usually regex to remove what's between the `<think>` tags)
 <Note title="Critical: Chat templates and tokenization" emoji="⚡" variant="danger">
 <Image src={chatTemplatesTokenisation} alt="Spacing, tokenization and template" />
@@ -110,40 +80,40 @@ This means a number of models are going to perform terribly if you do not make s
 Different tokenizers behave differently with spacing and special tokens. See this [visualization](https://x.com/danielhanchen/status/1796952220619157694) showing how spacing, tokenization, and templates interact. Never assume tokenizers behave identically!
 </Note>
-**Tokenizing the context and choices together or separately**
-When looking at an MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
-<Note title="Should you tokenize the context with the choices always?">
-Some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `enc(context + choice) = enc(context) + enc(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices only is not trivial, as the context tokens can "bleed out" into them, messing up the comparison.
-To give a concrete example, say you have characters `C1`, `C2`, and `C3` as base tokens of your vocabulary, and `C1C2` also happens to be a single token learned during BPE.
-Say your context is C1, and the choices C2 and C3.
-If you tokenize the context with the choices, you compare `C1C2` (one token) with `C1+C3` (two tokens). Even if you normalize the logprobs by length, you are not comparing the same thing.
-Comparing after tokenizing the context and choices separately means you compare `C1+C2` and `C1+C3`. But since `C1C2` is a token, the occurence of `C1+C2` is likely rare in the data your encoder saw, so it is an unlikely succession for your model, which can mess up your logprobabilities.
-If this is the case for your model, the solution is usually to go for the least worst option, comparing the comparable: compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
-</Note>
-**Paying attention to start and end of sentence tokens**
-Some pretrained models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating.
-You can also encounter some issues where your model won't stop on an end of sentence token like you would expect (for example, on `\n`), because your model will not predict this token alone but included in an higher level token (for example, `\n\n`, which can be a single token, especially for code models). In this case, you might need to add a specific check to "backtrack" on generated text to make sure you're cutting your generated sentence at the proper spot before computing metrics.
-**Multilinguality and tokenization**
-When looking at multilingual evaluations, you'll also need to see how to tokenize your text, depending on your evaluation task and metrics. As some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc. The number of tokens that the model is allowed to generate for an evaluation should also be language dependent, as not all languages are tokenized in similar amount of tokens (go back to the tokenization section to see why).
-**Code evaluations and end of sentence tokens**
-Code models usually have been trained with `\n\t` as a single token. This means that when generating text, they will often generate `\n\t` in one step. A task which defines `\n` as an end of sentence token (= to stop the generation) will let the model continue generating after a `\n\t`, if predicted as one token, since it's not the same as `\n`. But you would actually still want the model to stop. In these cases, you either need to update your end of sentence tokens, or define a mechanism to backtrack on the character representation of the latest tokens to stop (and cut) the generation a posteriori.
 ### Inference
 From this input text, the LLM generates a probability distribution of the most likely next tokens over all the vocabulary. To get a continued generation, we can take the most probable token (give or take some added randomness to get more interesting outputs) as the next one, then repeat the operation, using the new token as the end of the prompt, etc.
 <Image src={llmTk1} alt="LLM tokenization and prediction process" />
@@ -177,6 +147,7 @@ This allows us to apply one of the following metrics:
 To learn more about calibration, you can check [this paper](https://arxiv.org/abs/2207.05221) from Anthropic, on what it is, how to detect it, and how to train models to be well calibrated, and [this paper](https://arxiv.org/abs/2311.14648) on some possible limits of calibration).
 </Sidenote>
 <Note>
 A multiple choice question answer can be expressed as a free form generative evaluation too! For this reason, you'll sometimes see a mention of the task **formulation**.
@@ -195,6 +166,20 @@ The point at which MMLU MCF becomes non-random depends on the model size and tra
 </Note>
 #### Generative evaluations
 For a generative evaluation, we want the text generated by the model given an input prompt.
@@ -210,34 +195,3 @@ We can then compare this generation with references and score the distance betwe
 </Note>
-### Constraining model outputs
-In a number of cases, we want the model output to follow a specific format, for example to compare them to a reference.
-#### Using a prompt
-The easiest way to do this is to add a task prompt which contains very specific instructions as to how the model should answer (`Provide numerical answers in digits.`,`Use no abbreviation.`, etc).
-It won't necessarily work all the time but should be good enough for high capability models. That's the approach we followed in the [GAIA](https://huggingface.co/papers/2311.12983) paper for example.
-#### Few shots and in context learning
-The next way to do so is to constrain the model through what is called "in context learning". By providing examples in the prompt (what is called `few-shot prompting`), the model is implicitly biased towards following the repeated prompt shape for the actual sample.
-<Note>
-It's a method which was overall working quite well until end of 2023!
-However, the widespread adoption of instruction-tuning methods and the addition of instruction data in later stages of model pre-training (continuous pre-training) has biased more recent models towards specific output formats (what is being called [here](https://arxiv.org/abs/2407.07890) *Training on the test task*, and what I would call *overfitting the prompt format*). Reasoning models are also not playing that well with few shot examples because of the reasoning trace.
-It's also a method which can be limited for older models with smaller context sizes, as some few-shot examples can not fit into the context window.
-</Note>
-#### Structured text generation
-Structured text generation constrains the outputs to follow a given path, defined by a grammar or by regular expressions, for example. The `outlines` library implements this using finite state machines, which is very neat. (Other approaches exist, such as using interleaved generation for json generation, but the FSM one is my favorite).
-To understand more about what happens when using structured generation, you can check the [blog](https://huggingface.co/blog/evaluation-structured-outputs) we wrote together: structured generation reduce prompt variance in evaluation, and make results and rankings more stable. You can also check the overall `outlines` [blog](https://blog.dottxt.co/) for interesting implementations and observations linked to structured generation.
-However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show that structured generation can lower model performance on some tasks (like reasoning), by moving the prior too far away from the expected probability distribution.
-<Note title="Going further" emoji="📚" variant="warning">
--  ⭐ [Understanding how Finite State Machine when using structured generation](https://blog.dottxt.co/coalescence.html), by Outlines. Super clear guide on how their method works!
-- [The outlines method paper](https://arxiv.org/abs/2307.09702), a more academic explanation of the above
-- [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
-</Note>

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 import Accordion from "../../../components/Accordion.astro";
+import HtmlEmbed from "../../../components/HtmlEmbed.astro";
+In this section, we'll look at two steps for models: how input is preprocessed to be given to the model (`tokenization`), and how the model generates a prediction from it (`inference`).
+<Sidenote> If you want to learn more about how to actually train a model, you should go read the [Smol Training Guidebook!](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)</Sidenote>
+<HtmlEmbed src="d3-tokenization.html" frameless />
 ### Tokenization
+The input text (called a *prompt* at inference) is first split into *tokens*, small units of texts (which can be one or several characters, up to the word level) each associated with a number. The whole range of tokens a model can parse is called its *vocabulary*.
 #### Basics of tokenization: Why and how do we tokenize text?
 Since large language models are actually big mathematical functions, they eat numbers, not text.
 This was initially done using morpho-syntactic rules (*morpho-syntax* is like the grammar of word creation). Now most people use byte pair encoding (BPE), a smart statistical method to create the sub-words automatically depending on their frequency in a reference text.
+So as a summary: tokenization is a way to map small units of texts (which can be one or several characters, up to the word level) to numbers (similar to an index). When you want to process text, your input text (called a *prompt* at inference) is split into these *tokens* by a tokenizer. The whole range of tokens a model or tokenizer can parse is called its *vocabulary*.
 <Note title="Going further: Understanding tokenization" emoji="📚" variant="warning">
 - ⭐ [Explanation of different tokenization methods in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter2/4)
 - ⭐ [Conceptual guide about tokenization in the 🤗 doc](https://huggingface.co/docs/transformers/en/tokenizer_summary)
+- [Course by Jurafsky on tokenization (and other things)](https://web.stanford.edu/~jurafsky/slp3/2.pdf) - skip to 2.5 and 2.6
 </Note>
 <Note title="Going further: Byte Pair Encoding" emoji="📚" variant="warning">
+I would strongly recommend reading a longer explanation on how BPE works, as it's really a base of modern LLMs.
+- ⭐ [Explanation of BPE in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter6/5)
+- [BPE Paper (for text, as the method existed before in other fields)](https://aclanthology.org/P16-1162/)
 </Note>
+When building a tokenizer require making more choices than one would expect. For example, to tokenize numbers, you don't want to use a basic BPE, but do you only index 0 to 9, and assume all other numbers will be compositions of digits? Do you want to store numbers up to, say, one billion, individually?
+Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. This will affect some mathematical evaluation (and is the reason why almost no evaluation is pure arithmetics).
+<Note title="Going further: Tokenizing numbers" emoji="📚" variant="warning">
 - ⭐ [A nice visual demo by Yennie Jun of how tokenizers of Anthropic, Meta, OpenAI, and Mistral models split numbers](https://www.artfish.ai/p/how-would-you-tokenize-or-break-down)
 - [Small history by Beren Millidge of the evolution of number tokenization through the years](https://www.beren.io/2024-05-11-Integer-tokenization-is-now-much-less-insane/)
 </Note>
 #### How tokenization can mess up your evaluation
 **Managing fine-tuned models, system prompts and chat templates**
+Pre-2022, models used to simply be pretrained: text in, text out, nothing else. Then, we got instruction tuning and chat models in 2023, and in 2025 reasoning models. This means that we went from using raw text to using more and more formatting.
+<HtmlEmbed src="d3-tokenization-timeline.html" frameless />
+This means a number of models are going to perform terribly if you do not make sure to:
+1. respect the format the model expectes
+2. adds a system prompt at the very beginning of inference if your model requires one
+3. remove the thinking trace from reasoning models answers before processing them (you can usually regex to remove what's between the `<think>` tags)
 <Note title="Critical: Chat templates and tokenization" emoji="⚡" variant="danger">
 <Image src={chatTemplatesTokenisation} alt="Spacing, tokenization and template" />
 Different tokenizers behave differently with spacing and special tokens. See this [visualization](https://x.com/danielhanchen/status/1796952220619157694) showing how spacing, tokenization, and templates interact. Never assume tokenizers behave identically!
 </Note>
+**Paying attention to start and end of sentence tokens**
+Some pretrained models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating if they are not in your dataset.
+You can also encounter some issues where your model won't stop on an end of sentence token like you would expect. Code models usually have been trained with `\n\t` as a single token. This means that when generating text, they will often generate `\n\t` in one step. A task which defines `\n` as an end of sentence token (= to stop the generation) will let the model continue generating after a `\n\t`, if predicted as one token, since it's not the same as `\n`. But you would actually still want the model to stop. In these cases, you either need to update your end of sentence tokens, or define a mechanism to backtrack on the character representation of the latest tokens to stop (and cut) the generation a posteriori.
+**Multilinguality and tokenization**
+When looking at multilingual evaluations, you'll encounter two issues.
+First, as some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc.
+Then, tokenizers in general might be unfair to non-English languages. When training a BPE tokenizer, you use data from the different languages you want to cover, but most of the time, though, this data is unbalanced between languages (with, for example, an order of magnitude more English than Thai, or Burmese). Since BPE tokenizers create their vocabulary tokens based on the most frequent words seen, most of the long tokens will be English words - and most of the words from the less frequent languages will only be split at the character level. This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
+<iframe
+    src="https://OpenEvals-tokenizers-languages.hf.space"
+    frameborder="0"
+    width="850"
+    height="450"
+></iframe>
+If you are in this case, the number of tokens that the model is allowed to generate for an evaluation should also be language dependent, as not all languages are tokenized in similar amount of tokens.
+<Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
+- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and the embedded space comes from her work.
+- ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
+</Note>
 ### Inference
+Now that we know how to convert our input text into something the LLMs can parse, let's look at how models process this text.
 From this input text, the LLM generates a probability distribution of the most likely next tokens over all the vocabulary. To get a continued generation, we can take the most probable token (give or take some added randomness to get more interesting outputs) as the next one, then repeat the operation, using the new token as the end of the prompt, etc.
 <Image src={llmTk1} alt="LLM tokenization and prediction process" />
 To learn more about calibration, you can check [this paper](https://arxiv.org/abs/2207.05221) from Anthropic, on what it is, how to detect it, and how to train models to be well calibrated, and [this paper](https://arxiv.org/abs/2311.14648) on some possible limits of calibration).
 </Sidenote>
 <Note>
 A multiple choice question answer can be expressed as a free form generative evaluation too! For this reason, you'll sometimes see a mention of the task **formulation**.
 </Note>
+<Accordion title="Should you tokenize the context with the choices always?">
+When looking at multiple choices MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.
+However, some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `tok(context + choice) = tok(context) + tok(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices only is not trivial, as the context tokens can "bleed out" into them, messing up the comparison.
+To give a concrete example, say you have characters `C1`, `C2`, and `C3` as base tokens of your vocabulary, and `C1C2` also happens to be a single token learned during BPE.
+Say your context is C1, and the choices C2 and C3.
+If you tokenize the context with the choices, you compare `C1C2` (one token) with `C1+C3` (two tokens). Even if you normalize the logprobs by length, you are not comparing the same thing.
+Comparing after tokenizing the context and choices separately means you compare `C1+C2` and `C1+C3`. But since `C1C2` is a token, the occurence of `C1+C2` is likely rare in the data your encoder saw, so it is an unlikely succession for your model, which can mess up your logprobabilities.
+If this is the case for your model, the solution is usually to go for the least worst option, comparing the comparable: compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
+</Accordion>
 #### Generative evaluations
 For a generative evaluation, we want the text generated by the model given an input prompt.
 </Note>

app/src/content/chapters/general-knowledge/picking-your-evaluation.mdx CHANGED Viewed

@@ -6,11 +6,11 @@ import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 import HtmlEmbed from "../../../components/HtmlEmbed.astro";
-This section mostly comes from the FineTasks blog, which describes how our FineWeb team designed a method to select the best evaluations for pre-training ablations, across 9 languages!
-For these languages, we collected all available tasks that we could find, implementing a total of **185 tasks across languages** in [LightEval](https://github.com/huggingface/lighteval), HuggingFace's model evaluation library.
-Then, we began task selection with two primary goals: ensuring **evaluation diversity**, and making sure each task provided a **reliable signal** during pre-training.
 For evaluation diversity, we aimed to assess a broad range of model capabilities, including:

 import Sidenote from "../../../components/Sidenote.astro";
 import HtmlEmbed from "../../../components/HtmlEmbed.astro";
+In some cases, you don't want to "just" reproduce existing scores a posteriori, but you actually need to understand how well your model is training while it's happening. Evaluations you need then have different properties than evaluations for the final performance of models, as you need tasks which will provide good signal even when the model is not yet very good.
+So the FineWeb team designed a method to select the best evaluations for pre-training ablations, across 9 languages - let's listen to their wise advice.
+For these languages, we collected and implemented all available tasks that we could find, a total of **185 tasks**. Then, we began task selection with two primary goals: ensuring **evaluation diversity**, and making sure each task provided a **reliable signal** during pre-training.
 For evaluation diversity, we aimed to assess a broad range of model capabilities, including:

app/src/content/chapters/intro.mdx CHANGED Viewed

@@ -37,40 +37,33 @@ There are 3 main reasons for which people do evaluation, which tend to be confla
 When you select a setup to train models, you want to test something very similar, and make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties.
-In ML, these experiments are often referred to as ablations, and the core of them is actually having a good set of evaluations (looking at the loss will only get you so far!)
-For these evaluations, you need to select evaluations which give you a strong enough signal while being relatively cheap to run as you'll be running them **a lot**.
-You'll also need to look at both **trajectories** (is the performance better now that when starting training) and scores **ranges** (is the performance within what's expected). You actually... don't really care about the precise score themselves! This evaluation is therefore not here to tell you anything about actual model capabilities, but instead just here to confirm that your training approach is "as sound" as the other training approach, and that your model behaves in similar ways.
-#### Which model is the best on X?
-The next role of evaluation is simply to sort models to find and select the best architectures and approaches for use case X.
-If you have a leaderboard for your domain and task, take the best model, and it's not working for you, it's unlikely the next best model will work.
 <Sidenote>
 In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benchmarking and dataset design from the ImageNet era, the authors argue that, since scores are susceptible to instability, the only robust way to evaluate models is through rankings, and more specifically by finding broad groups of evaluations which provide consistent and stable rankings. I believe looking for ranking stability is indeed an extremely interesting approach to model benchmarking, as we have shown that LLMs *scores* on automated benchmarks are extremely susceptible to [minute changes in prompting](https://huggingface.co/blog/evaluation-structured-outputs), and that human evaluations are not more consistent - where *rankings* are actually more stable when using robust evaluation methods.
 </Sidenote>
-If you don't... that's where you need to think about designing your own evaluations, which we will cover below in section.
-<Note>
-"How do you know for sure if models can do X?" is a question which comes up a lot, and it is a very valid one. However, for any complex capability, **we cannot at the moment just say "this model is the best at this", but instead "this model is the best on this task that we hope is a good proxy for this capability, without any guarantee"**.
 </Note>
 #### When will we finally reach AGI?
-We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). However, this problem is not specific to machine learning! In human and animal studies, it is also quite hard to define what constitutes intelligence, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.
-To solve this, we should look at social sciences, as in these fields, people are used to thinking seriously about confounding factors in data gathering and results analysis, which I'm not seeing a lot in "intelligence evaluation" in ML for now.
-However, I also don't think we'll be able to define these broad capabilities at all (we'll just end up with moving targets) since we cannot define them in humans and animals at the moment, and frameworks made with the human (or animal) in mind will most likely not transfer well to models, as the underlying behaviors and assumptions are not the same.
-<Sidenote>
-I also believe that this question is a bad one, as targeting "general intelligence" is much more blurry, risky, and less useful than targetting good tools with specific capabilities for actual problems that humans encounter in their daily life.
-</Sidenote>
 ### So how do people evaluate models, then?
@@ -81,22 +74,24 @@ To my knowledge, at the moment, people use 3 main ways to do evaluation: automat
 Automated benchmarking usually works the following way: you'd like to know how well your model performs on something. This something can be a well-defined concrete **task**, such as *How well can my model classify spam from non spam emails?*, or a more abstract and general **capability**, such as *How good is my model at math?*.
 From this, you construct an evaluation, usually made of two things:
-- a collection of *samples*, given as input to the model to see what comes out as output, sometimes coupled with a reference (called gold) to compare with. Samples are usually designed to try to emulate what you want to test the model on: for example, if you are looking at email classification, you create a dataset of spam and non spam emails, try to include some hard edge cases, etc. For LLMs, the two main tasks are generation evaluation (comparing generated text with a reference after normalization), or multi-choice (compare the relative log-probabilities of possible continuations after a prompt).
-- a *metric*, which is a way to compute a score for the model. For example, how accurately can your model classify spam (score of well classified sample = 1, badly classified = 0).
 This is more interesting to do on data that was not included in the model training set, because you want to test if it **generalizes** well. You don't want a model which can only classify emails it has already "seen", that would not be very useful!
 <Note>
-A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. In less extreme cases, you still want to test if your model is able to generalize to data patterns which were not in the training set's distribution (for example, classify spam emails about 'health' products after having seen only spam emails about fake banks).
 </Note>
-This works quite well for very well-defined tasks, where performance is "easy" to assess and measure: when you are literally testing your model on spam classification, you can say "the model classified correctly n% of these samples". For LLMs benchmarks, some issues can arise, such as models [favoring specific choices based on the order in which they have been presented for multi-choice evaluations](https://arxiv.org/abs/2309.03882), and generative evaluations relying on normalisations which can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.
-For capabilities however, it's hard to decompose them into well-defined and precise tasks: what does "good at math" mean? good at arithmetic? at logic? able to reason on mathematical concepts?
-In this case, people tend to do more "holistic" evaluations, by not decomposing the capability in actual tasks, but assuming that performance on general samples will be a **good proxy** for what we aim to measure. For example, GSM8K is made of actual high school math problems, which require a whole set of capabilities to solve. It also means that both failure and success are very hard to interpret. Some capabilities or topics, such as "is this model good at writing poetry?" or "are the model outputs helpful?" are even harder to evaluate with automatic metrics - and at the same time, models now seem to have more and more **generalist** capabilities, so we need to evaluate their abilities in a broader manner. (For example, there was a debate in the scientific community as to whether LLMs [can draw](https://arxiv.org/abs/2303.12712) unicorns [or not](https://twitter.com/DimitrisPapail/status/1719119242186871275). A year later, seems like most can!)
-Automatic benchmarks also tend to have another problem: once they are published publicly in plain text, they are very likely to end up (often accidentally) in the training datasets of models. Some benchmarks creators, like the authors of BigBench, have tried to mitigate this by adding a "canary string" (a very specific combination of characters) for people to look for, and remove from training sets, but not everybody is aware of the mechanism nor trying to do this removal. There is also a non negligible quantity of benchmarks, so looking for accidental copies of absolutely all of them in data is costly. Other options include providing benchmarks in an [encrypted form](https://arxiv.org/pdf/2309.16575), or behind a [gating system](https://huggingface.co/datasets/Idavidrein/gpqa). However, when evaluating closed models behind black box APIs, there is no guarantee that the provided data won’t be later used internally for training or fine-tuning.
 <Note>
 The case were an evaluation dataset ends up in the training set is called **contamination**, and a model which was contaminated will have a high benchmark performance that does not generalize well to the underlying task (an extensive description of contamination can be found [here](https://aclanthology.org/2023.findings-emnlp.722/), and here is a fun way to [detect it](https://arxiv.org/abs/2311.06233)). A way to address contamination is to run [**dynamic benchmarks**](https://arxiv.org/abs/2104.14337) (evaluations on datasets which are regularly refreshed to provide scores on systematically unseen new data), but this approach is costly in the long term.
@@ -110,22 +105,20 @@ This is usually done by tasking humans with first, prompting models, then, gradi
 Different approaches exist to evaluate models with humans in the loop.
-**Vibes-checks** is the name given to manual evaluations done individually by some members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on many use cases, which range from coding to quality of smut written. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Often shared on Twitter and Reddit, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for). However, some people have been trying to do more methodical vibe-checks evaluations; for example, the user *Wolfram Ravenwolf* shares his model comparisons findings in a very systematic way through blogs (see [here](https://huggingface.co/blog/wolfram/llm-comparison-test-llama-3) for an example).
 Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
 The last approach is **systematic annotations**, where you provide extremely specific guidelines to paid selected annotators, in order to remove as much as the subjectivity bias as possible (this is the approach used by most data annotation companies). However, it can get extremely expensive fast, as you have to keep on doing evaluations in a continuous and non automatic manner for every new model you want to evaluate, and it can still fall prey to human bias (this [study](https://arxiv.org/abs/2205.00501) showed that people with different identities tend to rate model answer toxicity very differently).
-Recent [work](https://arxiv.org/pdf/2309.16349) has also shown that human evaluators tend to estimate the quality of answers based on first impressions, instead of actual factuality or faithfulness. Crowdsourced annotators are notably very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. In other terms, if a model says wrong things in a confident tone, human evaluators are much less likely to notice it, which could skew ratings towards the more assertive models. (Expert annotators are less likely to fall prey to these biases.) This kind of human bias was confirmed in another [paper](https://arxiv.org/pdf/2310.13548) : humans are most likely to prefer answers which appeal to their views or align with their opinions or errors, rather than answers which are factually correct.
-These biases are not unexpected, but they must be taken into account: not all use cases should rely on using human annotators, especially crowdsourced, unexpert ones - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark.
 #### Model as a judge
 To mitigate the cost of human annotators, some people have looked into using models or derived artifacts (preferably aligned with human preferences) to evaluate models' outputs. This approach is not new, as you can find techniques to measure summarization quality from [model embeddings](https://arxiv.org/abs/1904.09675) in 2019.
-Two approach exist for grading: using [generalist, high capability models](https://arxiv.org/abs/2306.05685v4) or using [small specialist models](https://arxiv.org/pdf/2405.01535) trained specifically to discriminate from preference data. The former approach gives results well correlated with human preference, but most strong enough models tend to be closed source, therefore subject to change behind APIs, and uninterpretable.
-LLM as judges have several strong limitations: they tend to [favor their own outputs](https://arxiv.org/abs/2404.13076) when scoring answers, are [bad at providing consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128) (though you can improve this with asking the model to explain its reasoning [before providing a score](https://twitter.com/seungonekim/status/1749289437165769177)), and are actually not that consistent [with human rankings](https://arxiv.org/pdf/2308.15812).
 My main personal gripe with using models as judges is that they introduce very subtle and un-interpretable bias in the answer selection. I feel that, much like when crossbreeding too much in genetics studies, you end up with dysfunctional animals or plants, by using LLMs to select and train LLMs, we are just as likely to introduce minute changes that will have bigger repercussions a couple generations down the line. I believe this type of bias is less likely to occur in smaller and more specialized models as judges (such as toxicity classifiers), but this remains to be rigorously tested and proven.

 When you select a setup to train models, you want to test something very similar, and make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties.
+In ML, experiments which test the impact of small changes on model performance are to as **ablations**, and the core of them is actually having a good set of evaluations: with a strong enough signal while relatively cheap to run as you'll be running them **a lot**.
+For ablations, you also need to look at both **trajectories** (is the performance better now that when starting training) and scores **ranges** (is the performance within what's expected). You actually... don't really care about the precise score themselves! This evaluation is therefore not here to tell you anything about actual model capabilities, but instead just here to confirm that your training approach is "as sound or better" as the other training approach, and that your model behaves in similar ways.
+#### Which model is the best on \<task\>?
+The next role of evaluation is simply to sort models to find and select the best model for a given use case.
+For common topics like math, code, or knowledge, there are likely several leaderboards comparing and ranking models using different datasets, and you usually just have to test the top contenders to find the best model for you (if they are not working for you, it's unlikely the next best models will work).
+You could want to run the evaluation and comparision yourself (by reusing existing benchmarks) to get more details to analyse on the model successes and failures, which we will cover below.
 <Sidenote>
 In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benchmarking and dataset design from the ImageNet era, the authors argue that, since scores are susceptible to instability, the only robust way to evaluate models is through rankings, and more specifically by finding broad groups of evaluations which provide consistent and stable rankings. I believe looking for ranking stability is indeed an extremely interesting approach to model benchmarking, as we have shown that LLMs *scores* on automated benchmarks are extremely susceptible to [minute changes in prompting](https://huggingface.co/blog/evaluation-structured-outputs), and that human evaluations are not more consistent - where *rankings* are actually more stable when using robust evaluation methods.
 </Sidenote>
+For less common topics, you might even need to think about designing your own evaluations, which is our last section.
+<Note title="Small caveat">
+Despite often grandiose claims, for any complex capability, we cannot at the moment just say "this model is the best at this", but should instead say **"this model is the best on this task that we hope is a good proxy for this capability, without any guarantee"**.
 </Note>
 #### When will we finally reach AGI?
+We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.
+There are, however, some issues with focusing on intelligence as a target. 1) Intelligence tends to end up being a moving target, as any time we reach a capability which was thought to be human specific, we redefine the term. 2) Our current frameworks are made with the human (or animal) in mind, and will most likely not transfer well to models, as the underlying behaviors and assumptions are not the same. 3) It is kind of a useless target too - we should target making models good at specific, well defined, purposeful and useful tasks (think accounting, reporting, etc) instead of aiming for AGI for the sake of it.
 ### So how do people evaluate models, then?
 Automated benchmarking usually works the following way: you'd like to know how well your model performs on something. This something can be a well-defined concrete **task**, such as *How well can my model classify spam from non spam emails?*, or a more abstract and general **capability**, such as *How good is my model at math?*.
 From this, you construct an evaluation, usually made of two things:
+- a collection of *samples*, given as input to the model to see what comes out as output, sometimes coupled with a reference (called gold) to compare with. Samples are usually designed to try to emulate what you want to test the model on: for example, if you are looking at toxicity classification, you create a dataset of toxic and non toxic sentences, try to include some hard edge cases, etc.
+- a *metric*, which is a way to compute a score for the model. For example, how accurately can your model classify toxicity (score of well classified sample = 1, badly classified = 0).
 This is more interesting to do on data that was not included in the model training set, because you want to test if it **generalizes** well. You don't want a model which can only classify emails it has already "seen", that would not be very useful!
 <Note>
+A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. In less extreme cases, you still want to test if your model is able to generalize to data patterns which were not in the training set's distribution (for example, classify toxicity on stack overflow after having seen only toxicity on reddit).
 </Note>
+This works quite well for very well-defined tasks, where performance is "easy" to assess and measure: when you are literally testing your model on classification, you can say "the model classified correctly n% of these samples". For LLMs benchmarks, some issues can arise, such as models [favoring specific choices based on the order in which they have been presented for multi-choice evaluations](https://arxiv.org/abs/2309.03882), and generative evaluations relying on normalisations which can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.
+For **capabilities** however, it's hard to decompose them into well-defined and precise tasks: what does "good at math" mean? good at arithmetic? at logic? able to reason on mathematical concepts?
+In this case, people tend to do more "holistic" evaluations, by not decomposing the capability in actual tasks, but assuming that performance on general samples will be a **good proxy** for what we aim to measure. For example, GSM8K is made of actual high school math problems, which require a whole set of capabilities to solve. It also means that both failure and success are very hard to interpret. Some capabilities or topics, such as "is this model good at writing poetry?" or "are the model outputs helpful?" are even harder to evaluate with automatic metrics - and at the same time, models now seem to have more and more **generalist** capabilities, so we need to evaluate their abilities in a broader manner.
+<Sidenote> For example, there was a debate in the scientific community as to whether LLMs [can draw](https://arxiv.org/abs/2303.12712) unicorns [or not](https://twitter.com/DimitrisPapail/status/1719119242186871275). A year later, seems like most can! </Sidenote>
+Automatic benchmarks also tend to have another problem: once they are published publicly in plain text, they are very likely to end up (often accidentally) in the training datasets of models. Some benchmarks creators, like the authors of BigBench, have tried to mitigate this by adding a *canary string* (a very specific combination of characters) for people to look for, and remove from training sets, but not everybody is aware of the mechanism nor trying to do this removal. There is also a non negligible quantity of benchmarks, so looking for accidental copies of absolutely all of them in data is costly. Other options include providing benchmarks in an [**encrypted** form](https://arxiv.org/pdf/2309.16575), or behind a [**gating** system](https://huggingface.co/datasets/Idavidrein/gpqa). However, when evaluating closed models (that are behind APIs), there is no guarantee that the prompts you give won’t be later used internally for training or fine-tuning.
 <Note>
 The case were an evaluation dataset ends up in the training set is called **contamination**, and a model which was contaminated will have a high benchmark performance that does not generalize well to the underlying task (an extensive description of contamination can be found [here](https://aclanthology.org/2023.findings-emnlp.722/), and here is a fun way to [detect it](https://arxiv.org/abs/2311.06233)). A way to address contamination is to run [**dynamic benchmarks**](https://arxiv.org/abs/2104.14337) (evaluations on datasets which are regularly refreshed to provide scores on systematically unseen new data), but this approach is costly in the long term.
 Different approaches exist to evaluate models with humans in the loop.
+**Vibes-checks** is the name given to manual evaluations done individually by some members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on many use cases, which range from coding to quality of smut written. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Often shared on Twitter and Reddit, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).
 Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
 The last approach is **systematic annotations**, where you provide extremely specific guidelines to paid selected annotators, in order to remove as much as the subjectivity bias as possible (this is the approach used by most data annotation companies). However, it can get extremely expensive fast, as you have to keep on doing evaluations in a continuous and non automatic manner for every new model you want to evaluate, and it can still fall prey to human bias (this [study](https://arxiv.org/abs/2205.00501) showed that people with different identities tend to rate model answer toxicity very differently).
+However, humans can be biased: for example, they tend to estimate the quality of answers [based on first impressions](https://arxiv.org/pdf/2309.16349), instead of actual factuality or faithfulness, and are very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. These are only some of the many biases that human judges can fall prey to (as we'll see below). They are not unexpected, but they must be taken into account: not all use cases should rely on using human annotators, especially crowdsourced, unexpert ones - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark.
 #### Model as a judge
 To mitigate the cost of human annotators, some people have looked into using models or derived artifacts (preferably aligned with human preferences) to evaluate models' outputs. This approach is not new, as you can find techniques to measure summarization quality from [model embeddings](https://arxiv.org/abs/1904.09675) in 2019.
+Two approach exist for grading: using [generalist, high capability models](https://arxiv.org/abs/2306.05685v4) or using [small specialist models](https://arxiv.org/pdf/2405.01535) trained specifically to discriminate from preference data.
+Model as judges have several strong limitations, because they are as biased as humans but along different axes (they can't [provide consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128), are actually not that consistent [with human rankings](https://arxiv.org/pdf/2308.15812), etc, as we'll see below).
 My main personal gripe with using models as judges is that they introduce very subtle and un-interpretable bias in the answer selection. I feel that, much like when crossbreeding too much in genetics studies, you end up with dysfunctional animals or plants, by using LLMs to select and train LLMs, we are just as likely to introduce minute changes that will have bigger repercussions a couple generations down the line. I believe this type of bias is less likely to occur in smaller and more specialized models as judges (such as toxicity classifiers), but this remains to be rigorously tested and proven.

app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx DELETED Viewed

@@ -1,81 +0,0 @@
----
-title: "Troubleshooting inference"
----
-import Note from "../../../components/Note.astro";
-import Sidenote from "../../../components/Sidenote.astro";
-## Troubleshooting inference
-### My results are very bad
-The first thing to do is always to inspect your model generations in detail.
-Some frequent problems you should look for when troubleshooting are:
-- Is your model output parsing too strict before computing the metric? It can lead to the answer being lost (obvious fix is to make it less strict, but you'll get more false positives!)
-- Is your model struggling to follow your output format in few shot? This frequently happens in recent models trained on too specific evaluation formats, and you can either adapt your prompt format, or just state that models should be able to follow it and that the ones struggling are not good enough for the task you are considering.
-- Is your model exceedingly verbose? In this case, it likely never gets to the correct answer - this is more frequent in long context models (we observed it with Qwen and Command R models in 2024) and reasoning models, especially if the tasks stops generation too soon. You can either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly.
-### My model is very slow!
-➡️ Changing the batch size
-If you want absolute reproducibility (given a specific hardware and a specific evaluation prompt), you're probably using a batch size of one. However, moving to higher batch sizes will likely make your evaluation faster (given that it fits within the memory requirements of your hardware)
-➡️ Data parallelism
-You can also duplicate your model on several GPUs instead of loading it on one single GPU, and provide subsets of the data to each GPU copy, then aggregate the computation results.
-This means that each data stream will be handled in parallel, at the same time as the others, which divides your total execution time by the number of GPUs.
-However, if you can, all GPUs should be on a single node to avoid inter-node bottlenecks.
-➡️ Changing the inference code
-Not all inference libraries run at the same speed, and some code is more optimized than other. You'll need to experiment a bit to find which libraries have the fastest inference, and if you are using pytorch, I recommend looking at the model inference optimization checklist [here](https://pytorch.org/serve/performance_checklist.html).
-➡️ Changing the precision
-If your model is very slow, you can reduce its size by reducing the precision of the computations. A model stored in float32 does very precise computations (using 32bits per number stored!) that are also very memory and compute heavy - moving to `blfoat16` or `float16` (half the precision) should make the model twice as fast at a loss of precision which should almost not matter. If you want bumps in speed, you can quantize it even more, to 8 or 4 bits (using `gptq` or `bitsandbytes` for example), as n-bit matrix computations should be faster and your model will take even less space in memory (however, some quantization libraries might be a bit slow, so test things out for your use cases!).
-### My model is very big!
-You can estimate the minimal theoretical memory required to load a given model (and therefore hardware) with the **following formula**:
-`<memory (in GB)> = <number of parameters (in G)> * <precision factor>`
-Since you can store 8 bits in a Byte, the memory required is the total number of parameters times the number of Bytes required to store one parameter. The precision factor is therefore 4 for `float32`,  2 for `float16` or `bfoat16`, 1 for `8bit`, and 0.5 for `4bit` models, etc.
-And that's it!
-I would actually recommend using  `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
-### My model does not fit on a GPU
-➡️ Quantization
-The first obvious thing is to play on the `<precision factor>` above: going from float32 to 4 bits reduces memory requirements by 8!
-However, using too low a precision can give worse results, so for some models (especially medium range), you might want to stay in float16 or 8bit. (Quantization seems to affect very big models performance less, possibly because of information redundancy).
-➡️ Model parallelism
-Model parallelism includes a range of techniques which cut your model in smaller sub-model pieces, to load and run each of these smaller pieces on a single different GPU. This requires less memory since you never load the full model at once, but can be slower.
-<Note title="Model parallelism strategies" emoji="🔀" variant="info">
-The 2 main types of model parallelism are
-- **Pipeline parallelism**, where the model is split at the whole layer level, and the layers are dispatched on different GPUs. Since layer 1's output is layer 2's input, this leads to a slower execution, as GPUs will be idle while waiting, which is called a "bubble" (and data must be transferred from one GPU to the next). The bubble can be reduced by splitting the inputs into smaller batches. It's being natively added to PyTorch with the `PiPPy` [lib](https://github.com/pytorch/PiPPy), and this is what `accelerate` uses under the hood for parallelism.
-- **Tensor parallelism**, where the model is split at the matrix computation level. This means that the matrices will be split on rows or columns, and the total result aggregated. This is incredibly efficient as long as all GPUs are on the same node (to avoid inter node network bottlenecks), but can be hard to code. You'll find cool implementations of this in the `vllm` lib. It provides **insane speedups**.
-</Note>
-The best document on the different kinds of parallelism (including data parallelism, for speedups) is [here](https://huggingface.co/docs/transformers/v4.15.0/en/parallelism).
-➡️ CPU offloading
-CPU offloading moves some of the computations and models parts to CPU, in order to reduce GPU memory usage. It's **considerably slower** than any other method here, mostly because you need to move data from one device to another all the time.
-An example of this is [ZeRO-Offload](https://arxiv.org/abs/2101.06840) by Deepspeed, which distributes parameters between CPU and GPU (on top of using other optimization described in the ZeRO-2 paper). On CPU are passed gradients, optimizer states and fp32 model parameter computations during optimisation, whereas on GPU, you'll find fp16 parameters and forward/backward pass, to leverage CPU memory used and GPU computations while minimizing communication between both.
-➡️ My model fits on a GPU but I still get OOMs!
-You likely have a problem with your context size, then.
-I recommend:
-1) testing if your model truly does fit on a GPU with some dummy inference data loaded. This dummy inference data should have a big enough context size (representative of your task)
-2) lowering the batch size, or removing the auto-batch size search which could lead to an accidental OOM error, if you have this enabled
-3) more generally, making sure that samples are presented to your model in inverse context size order, to be sure that your model will fail directly if the context size is too big, and not after having run for X hours.

app/src/content/chapters/troubleshooting/troubleshooting-reproducibility.mdx CHANGED Viewed

@@ -17,8 +17,6 @@ Usually, this means either using the evaluation default code as provided by the
 If you want to easily understand what kind of discrepancies happen when using different implementations, you can explore [this blog](https://huggingface.co/blog/open-llm-leaderboard-mmlu) (⭐) we wrote with the eval team at HuggingFace. It studies the differences we observed between 3 common implementations of the MMLU evaluation (in `lm_eval`, `helm`, and in the original author implementation), and how they change model scores.
-*Note: This is precisely for this reason that a Hugging Face team decided to launch the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), to get unified and homogeneous comparisons of models scores in order to compare them to internal experiments.*
 #### Subtle implementation or loading difference
 We've observed that the following were easy things to mess up, even when using the same code base:
 - **Different random seeds.**

 If you want to easily understand what kind of discrepancies happen when using different implementations, you can explore [this blog](https://huggingface.co/blog/open-llm-leaderboard-mmlu) (⭐) we wrote with the eval team at HuggingFace. It studies the differences we observed between 3 common implementations of the MMLU evaluation (in `lm_eval`, `helm`, and in the original author implementation), and how they change model scores.
 #### Subtle implementation or loading difference
 We've observed that the following were easy things to mess up, even when using the same code base:
 - **Different random seeds.**

app/src/content/embeds/d3-tokenization-timeline.html ADDED Viewed

	@@ -0,0 +1,266 @@

+<div class="d3-prompt-evolution">
+  <svg viewBox="0 0 900 500" xmlns="http://www.w3.org/2000/svg">
+    <defs>
+      <marker id="arrowhead-prompt" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
+        <polygon points="0 0, 10 3, 0 6" fill="currentColor" />
+      </marker>
+    </defs>
+    <!-- Stage 1: Raw Text -->
+    <g class="stage">
+      <rect x="30" y="40" width="240" height="140" rx="8" class="stage-box stage-1"/>
+      <text x="150" y="65" text-anchor="middle" class="stage-title">Raw Text</text>
+      <text x="150" y="85" text-anchor="middle" class="stage-subtitle">Early Models (before 2022)</text>
+      <foreignObject x="45" y="95" width="210" height="80">
+        <div xmlns="http://www.w3.org/1999/xhtml" class="code-example">
+          <div class="code-line">Translate to French:</div>
+          <div class="code-line">Hello world</div>
+        </div>
+      </foreignObject>
+    </g>
+    <!-- Arrow 1 -->
+    <path d="M 270 110 L 320 110" class="arrow" marker-end="url(#arrowhead-prompt)"/>
+    <text x="295" y="100" text-anchor="middle" class="arrow-label">Evolution</text>
+    <!-- Stage 2: JSON + Chat Templates -->
+    <g class="stage">
+      <rect x="320" y="20" width="260" height="180" rx="8" class="stage-box stage-2"/>
+      <text x="450" y="45" text-anchor="middle" class="stage-title">Chat Templates (in JSON)</text>
+      <text x="450" y="65" text-anchor="middle" class="stage-subtitle">Chat Models (2022-2025)</text>
+      <foreignObject x="335" y="75" width="230" height="120">
+        <div xmlns="http://www.w3.org/1999/xhtml" class="code-example">
+          <div class="code-line json-brace">{</div>
+          <div class="code-line indent">"role": "system",</div>
+          <div class="code-line indent">"content": "You are..."</div>
+          <div class="code-line json-brace">},</div>
+          <div class="code-line json-brace">{</div>
+          <div class="code-line indent">"role": "user",</div>
+          <div class="code-line indent">"content": "Hello"</div>
+          <div class="code-line json-brace">}</div>
+        </div>
+      </foreignObject>
+    </g>
+    <!-- Arrow 2 -->
+    <path d="M 580 110 L 630 110" class="arrow" marker-end="url(#arrowhead-prompt)"/>
+    <text x="605" y="100" text-anchor="middle" class="arrow-label">Evolution</text>
+    <!-- Stage 3: JSON + XML (Reasoning) -->
+    <g class="stage">
+      <rect x="630" y="10" width="240" height="200" rx="8" class="stage-box stage-3"/>
+      <text x="750" y="35" text-anchor="middle" class="stage-title">JSON + XML</text>
+      <text x="750" y="55" text-anchor="middle" class="stage-subtitle">Reasoning Models (2025+)</text>
+      <foreignObject x="645" y="65" width="210" height="140">
+        <div xmlns="http://www.w3.org/1999/xhtml" class="code-example">
+          <div class="code-line json-brace">{</div>
+          <div class="code-line indent">"role": "assistant",</div>
+          <div class="code-line indent">"content": [</div>
+          <div class="code-line indent2 xml-tag">&lt;thinking&gt;</div>
+          <div class="code-line indent2">reasoning...</div>
+          <div class="code-line indent2 xml-tag">&lt;/thinking&gt;</div>
+          <div class="code-line indent2 xml-tag">&lt;output&gt;</div>
+          <div class="code-line indent2">response</div>
+          <div class="code-line indent2 xml-tag">&lt;/output&gt;</div>
+          <div class="code-line indent">]</div>
+          <div class="code-line json-brace">}</div>
+        </div>
+      </foreignObject>
+    </g>
+    <!-- Key Features Labels -->
+    <g class="features">
+      <text x="150" y="200" text-anchor="middle" class="feature-label">• Simple prompts</text>
+      <text x="150" y="220" text-anchor="middle" class="feature-label">• Generally no structure</text>
+      <text x="150" y="240" text-anchor="middle" class="feature-label">• Completion-based</text>
+    </g>
+    <g class="features">
+      <text x="450" y="220" text-anchor="middle" class="feature-label">• Role separation</text>
+      <text x="450" y="240" text-anchor="middle" class="feature-label">• Chat/Turn-based</text>
+    </g>
+    <g class="features">
+      <text x="750" y="230" text-anchor="middle" class="feature-label">• Chat/Turn-based</text>
+      <text x="750" y="250" text-anchor="middle" class="feature-label">with added tags for control</text>
+    </g>
+    <!-- Timeline -->
+    <line x1="50" y1="320" x2="850" y2="320" class="timeline"/>
+    <circle cx="150" cy="320" r="6" class="timeline-dot"/>
+    <circle cx="450" cy="320" r="6" class="timeline-dot"/>
+    <circle cx="750" cy="320" r="6" class="timeline-dot"/>
+    <text x="150" y="345" text-anchor="middle" class="timeline-label">Before 2022</text>
+    <text x="450" y="345" text-anchor="middle" class="timeline-label">2022-2025</text>
+    <text x="750" y="345" text-anchor="middle" class="timeline-label">2025+</text>
+  </svg>
+</div>
+<style>
+  .d3-prompt-evolution {
+    position: relative;
+    width: 100%;
+  }
+  .d3-prompt-evolution svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  /* Stage boxes */
+  .d3-prompt-evolution .stage-box {
+    stroke-width: 2;
+  }
+  .d3-prompt-evolution .stage-1 {
+    fill: #e3f2fd;
+    stroke: #1976d2;
+  }
+  .d3-prompt-evolution .stage-2 {
+    fill: #f3e5f5;
+    stroke: #7b1fa2;
+  }
+  .d3-prompt-evolution .stage-3 {
+    fill: #e8f5e9;
+    stroke: #388e3c;
+  }
+  [data-theme="dark"] .d3-prompt-evolution .stage-1 {
+    fill: rgba(25, 118, 210, 0.15);
+  }
+  [data-theme="dark"] .d3-prompt-evolution .stage-2 {
+    fill: rgba(123, 31, 162, 0.15);
+  }
+  [data-theme="dark"] .d3-prompt-evolution .stage-3 {
+    fill: rgba(56, 142, 60, 0.15);
+  }
+  /* Text styles */
+  .d3-prompt-evolution .stage-title {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    font-size: 16px;
+    font-weight: 700;
+    fill: var(--text-color, #333);
+  }
+  .d3-prompt-evolution .stage-subtitle {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    font-size: 11px;
+    fill: var(--muted-color, #666);
+  }
+  /* Code examples */
+  .d3-prompt-evolution .code-example {
+    font-family: 'Monaco', 'Courier New', monospace;
+    font-size: 9px;
+    line-height: 1.4;
+    color: var(--text-color, #333);
+    padding: 4px;
+  }
+  .d3-prompt-evolution .code-line {
+    margin: 1px 0;
+  }
+  .d3-prompt-evolution .indent {
+    padding-left: 12px;
+  }
+  .d3-prompt-evolution .indent2 {
+    padding-left: 24px;
+  }
+  .d3-prompt-evolution .json-brace {
+    color: var(--primary-color, #1976d2);
+    font-weight: 600;
+  }
+  .d3-prompt-evolution .xml-tag {
+    color: #d32f2f;
+    font-weight: 600;
+  }
+  /* Arrows */
+  .d3-prompt-evolution .arrow {
+    fill: none;
+    stroke: var(--muted-color, #999);
+    stroke-width: 2;
+    color: var(--muted-color, #999);
+  }
+  .d3-prompt-evolution .arrow-label {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    font-size: 10px;
+    font-style: italic;
+    fill: var(--muted-color, #666);
+  }
+  /* Feature labels */
+  .d3-prompt-evolution .feature-label {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    font-size: 11px;
+    fill: var(--text-color, #555);
+  }
+  /* Timeline */
+  .d3-prompt-evolution .timeline {
+    stroke: var(--border-color, #ddd);
+    stroke-width: 2;
+  }
+  .d3-prompt-evolution .timeline-dot {
+    fill: var(--primary-color, #1976d2);
+  }
+  .d3-prompt-evolution .timeline-label {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    font-size: 12px;
+    font-weight: 600;
+    fill: var(--text-color, #333);
+  }
+  .d3-prompt-evolution .model-example {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    font-size: 11px;
+    font-style: italic;
+    fill: var(--muted-color, #666);
+  }
+  /* Benefits section */
+  .d3-prompt-evolution .benefits-box {
+    fill: var(--surface-bg, #fafafa);
+    stroke: var(--border-color, #e0e0e0);
+    stroke-width: 1.5;
+  }
+  .d3-prompt-evolution .benefits-title {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    font-size: 13px;
+    font-weight: 700;
+    fill: var(--text-color, #333);
+  }
+  .d3-prompt-evolution .benefit-text {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    font-size: 11px;
+    fill: var(--text-color, #555);
+  }
+  [data-theme="dark"] .d3-prompt-evolution .benefits-box {
+    fill: rgba(255, 255, 255, 0.05);
+  }
+</style>
+<script>
+  (() => {
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-prompt-evolution'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-prompt-evolution'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', bootstrap, { once: true });
+    } else {
+      bootstrap();
+    }
+  })();
+</script>

app/src/content/embeds/d3-tokenization.html ADDED Viewed

	@@ -0,0 +1,168 @@

+<div class="d3-tokenization">
+  <svg viewBox="0 0 800 400" xmlns="http://www.w3.org/2000/svg">
+    <defs>
+      <marker id="arrowhead-tok" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
+        <polygon points="0 0, 10 3, 0 6" fill="currentColor" />
+      </marker>
+    </defs>
+    <!-- Input Text -->
+    <rect x="50" y="50" width="200" height="80" rx="5" class="box"/>
+    <text x="150" y="75" text-anchor="middle" class="text title">Input Text</text>
+    <text x="150" y="100" text-anchor="middle" class="text label">"Hello, world!"</text>
+    <!-- Arrow 1 -->
+    <path d="M 250 90 L 290 90" class="arrow" marker-end="url(#arrowhead-tok)"/>
+    <!-- Tokenizer -->
+    <rect x="290" y="60" width="120" height="60" rx="5" class="process"/>
+    <text x="350" y="85" text-anchor="middle" class="text title">Tokenizer</text>
+    <text x="350" y="105" text-anchor="middle" class="text label" font-size="10">Split into tokens</text>
+    <!-- Arrow 2 -->
+    <path d="M 410 90 L 450 90" class="arrow" marker-end="url(#arrowhead-tok)"/>
+    <!-- Tokens -->
+    <rect x="450" y="30" width="280" height="120" rx="5" class="box"/>
+    <text x="590" y="55" text-anchor="middle" class="text title">Tokens</text>
+    <!-- Token boxes -->
+    <rect x="470" y="70" width="60" height="30" rx="3" class="token-box"/>
+    <text x="500" y="90" text-anchor="middle" class="text token">Hello</text>
+    <rect x="540" y="70" width="40" height="30" rx="3" class="token-box"/>
+    <text x="560" y="90" text-anchor="middle" class="text token">,</text>
+    <rect x="590" y="70" width="60" height="30" rx="3" class="token-box"/>
+    <text x="620" y="90" text-anchor="middle" class="text token">world</text>
+    <rect x="660" y="70" width="40" height="30" rx="3" class="token-box"/>
+    <text x="680" y="90" text-anchor="middle" class="text token">!</text>
+    <!-- Token IDs -->
+    <text x="500" y="125" text-anchor="middle" class="text token-id">[5425]</text>
+    <text x="560" y="125" text-anchor="middle" class="text token-id">[11]</text>
+    <text x="620" y="125" text-anchor="middle" class="text token-id">[1917]</text>
+    <text x="680" y="125" text-anchor="middle" class="text token-id">[0]</text>
+    <!-- Arrow 3 -->
+    <path d="M 590 150 L 590 190" class="arrow" marker-end="url(#arrowhead-tok)"/>
+    <!-- Model -->
+    <rect x="480" y="190" width="220" height="100" rx="5" class="model"/>
+    <text x="590" y="215" text-anchor="middle" class="text title">Language Model</text>
+    <!-- Model internal representation -->
+    <g transform="translate(520, 230)">
+      <circle cx="20" cy="15" r="8" class="node-circle"/>
+      <circle cx="50" cy="15" r="8" class="node-circle"/>
+      <circle cx="80" cy="15" r="8" class="node-circle"/>
+      <circle cx="110" cy="15" r="8" class="node-circle"/>
+      <circle cx="140" cy="15" r="8" class="node-circle"/>
+    </g>
+    <text x="590" y="275" text-anchor="middle" class="text label" font-size="10">Process & Generate</text>
+    <!-- Arrow 4 -->
+    <path d="M 590 290 L 590 330" class="arrow" marker-end="url(#arrowhead-tok)"/>
+    <!-- Output -->
+    <rect x="490" y="330" width="200" height="50" rx="5" class="box"/>
+    <text x="590" y="360" text-anchor="middle" class="text label">Output / Prediction</text>
+  </svg>
+</div>
+<style>
+  .d3-tokenization {
+    position: relative;
+    width: 100%;
+  }
+  .d3-tokenization svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-tokenization .box {
+    fill: var(--surface-bg, #f0f4ff);
+    stroke: var(--primary-color, #4169e1);
+    stroke-width: 2;
+  }
+  .d3-tokenization .process {
+    fill: #fff8e1;
+    stroke: #ff9800;
+    stroke-width: 2;
+  }
+  .d3-tokenization .model {
+    fill: #e8f5e9;
+    stroke: #4caf50;
+    stroke-width: 2;
+  }
+  .d3-tokenization .text {
+    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    fill: var(--text-color, #333);
+  }
+  .d3-tokenization .title {
+    font-size: 14px;
+    font-weight: 600;
+  }
+  .d3-tokenization .label {
+    font-size: 12px;
+  }
+  .d3-tokenization .token {
+    font-size: 11px;
+    font-family: 'Monaco', 'Courier New', monospace;
+  }
+  .d3-tokenization .token-id {
+    font-size: 9px;
+    fill: var(--muted-color, #666);
+  }
+  .d3-tokenization .arrow {
+    fill: none;
+    stroke: var(--muted-color, #666);
+    stroke-width: 2;
+    color: var(--muted-color, #666);
+  }
+  .d3-tokenization .token-box {
+    fill: white;
+    stroke: var(--primary-color, #4169e1);
+    stroke-width: 1.5;
+  }
+  .d3-tokenization .node-circle {
+    fill: #81c784;
+    opacity: 0.7;
+  }
+  [data-theme="dark"] .d3-tokenization .box {
+    fill: rgba(65, 105, 225, 0.1);
+  }
+  [data-theme="dark"] .d3-tokenization .token-box {
+    fill: var(--surface-bg, #1a1a1a);
+  }
+  [data-theme="dark"] .d3-tokenization .process {
+    fill: rgba(255, 152, 0, 0.15);
+  }
+  [data-theme="dark"] .d3-tokenization .model {
+    fill: rgba(76, 175, 80, 0.15);
+  }
+</style>
+<script>
+  (() => {
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-tokenization'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-tokenization'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', bootstrap, { once: true });
+    } else {
+      bootstrap();
+    }
+  })();
+</script>