evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 5 days ago

Commit

427b813

1 Parent(s): 6b1cc14

fixed text

Browse files

Files changed (2) hide show

app/src/content/assets/image/env.png +3 -0
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +20 -4

app/src/content/assets/image/env.png ADDED Viewed

Git LFS Details

SHA256: fbcef0931fde4128efb8451e71e3c1bdbd663e2eb8e82f2bf982a3f048f376a7
Pointer size: 131 Bytes
Size of remote file: 659 kB

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -5,7 +5,9 @@ title: "Designing your automatic evaluation"
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 import HtmlEmbed from "../../../components/HtmlEmbed.astro";
 import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
 ### Dataset
@@ -109,7 +111,7 @@ If you are looking at **log-probabilities**, your metrics are going to be easy:
 If you're looking at **generative** evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
-## The hardest part of evaluation: Scoring free form text
 Scoring free-form text is tricky because there are typically many different ways to express the same correct answer, making it hard to determine semantic equivalence through simple string matching, and output variations can make two semantically identical answers look completely different. Responses can be partially correct or contain a mix of accurate and inaccurate information. There can even be no single ground truth for the problem at hand, for example for tasks requiring to judge coherence, helpfulness, and style, which are inherently subjective and context-dependent.
@@ -548,9 +550,9 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
 - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
 </Note>
-### The forgotten children of evaluation
-#### Confidence
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
@@ -558,5 +560,19 @@ These confidence intervals from the raw scores can be obtained from standard dev
 You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
-#### Cost

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 import HtmlEmbed from "../../../components/HtmlEmbed.astro";
+import Image from "../../../components/Image.astro";
 import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
+import envImage from '../../assets/image/env.png';
 ### Dataset
 If you're looking at **generative** evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
+## Evaluation's main challenge: Scoring free form text
 Scoring free-form text is tricky because there are typically many different ways to express the same correct answer, making it hard to determine semantic equivalence through simple string matching, and output variations can make two semantically identical answers look completely different. Responses can be partially correct or contain a mix of accurate and inaccurate information. There can even be no single ground truth for the problem at hand, for example for tasks requiring to judge coherence, helpfulness, and style, which are inherently subjective and context-dependent.
 - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
 </Note>
+## The forgotten children of evaluation
+### Confidence interval and statistical validity
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
 You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
+### Cost and efficiency
+When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
+We suggest you report the following:
+- **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
+<Sidenote> These cost metrics can also be critical when comparing evaluation methods. For instance, while using a powerful LLM as a judge might provide better signal than automatic metrics, the 100x cost increase may not be justified for all use cases. Similarly, sampling-based metrics (pass@k, maj@n) multiply costs with the number of samples, which should be weighed against the improved signal they provide.</Sidenote>
+- **Time**: Document the inference time required by the model to complete the evaluation. This includes both the actual inference time and any overhead from API rate limits. This is particularly important for any time-sensitive applications (like some agentic tool use, as in GAIA2).
+<Image src={envImage} alt="Environmental impact metrics for model evaluation" width="400px" />
+Last but not least, reporting the environmental footprint of the models you are running is becoming increasingly important with the overall state of resources available on earth. This includes carbon emissions from training and energy consumption at inference, and these will depend on the model size, hardware (if you know it) and the tokens generated. Some smaller or quantized models reach a very interesting performance to consumption ratio