Clémentine
commited on
Commit
·
427b813
1
Parent(s):
6b1cc14
fixed text
Browse files
app/src/content/assets/image/env.png
ADDED
|
Git LFS Details
|
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx
CHANGED
|
@@ -5,7 +5,9 @@ title: "Designing your automatic evaluation"
|
|
| 5 |
import Note from "../../../components/Note.astro";
|
| 6 |
import Sidenote from "../../../components/Sidenote.astro";
|
| 7 |
import HtmlEmbed from "../../../components/HtmlEmbed.astro";
|
|
|
|
| 8 |
import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
|
|
|
|
| 9 |
|
| 10 |
### Dataset
|
| 11 |
|
|
@@ -109,7 +111,7 @@ If you are looking at **log-probabilities**, your metrics are going to be easy:
|
|
| 109 |
|
| 110 |
If you're looking at **generative** evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
|
| 111 |
|
| 112 |
-
##
|
| 113 |
|
| 114 |
Scoring free-form text is tricky because there are typically many different ways to express the same correct answer, making it hard to determine semantic equivalence through simple string matching, and output variations can make two semantically identical answers look completely different. Responses can be partially correct or contain a mix of accurate and inaccurate information. There can even be no single ground truth for the problem at hand, for example for tasks requiring to judge coherence, helpfulness, and style, which are inherently subjective and context-dependent.
|
| 115 |
|
|
@@ -548,9 +550,9 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
|
|
| 548 |
- [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
|
| 549 |
</Note>
|
| 550 |
|
| 551 |
-
|
| 552 |
|
| 553 |
-
|
| 554 |
|
| 555 |
When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
|
| 556 |
|
|
@@ -558,5 +560,19 @@ These confidence intervals from the raw scores can be obtained from standard dev
|
|
| 558 |
|
| 559 |
You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
|
| 560 |
|
| 561 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 562 |
|
|
|
|
| 5 |
import Note from "../../../components/Note.astro";
|
| 6 |
import Sidenote from "../../../components/Sidenote.astro";
|
| 7 |
import HtmlEmbed from "../../../components/HtmlEmbed.astro";
|
| 8 |
+
import Image from "../../../components/Image.astro";
|
| 9 |
import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
|
| 10 |
+
import envImage from '../../assets/image/env.png';
|
| 11 |
|
| 12 |
### Dataset
|
| 13 |
|
|
|
|
| 111 |
|
| 112 |
If you're looking at **generative** evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
|
| 113 |
|
| 114 |
+
## Evaluation's main challenge: Scoring free form text
|
| 115 |
|
| 116 |
Scoring free-form text is tricky because there are typically many different ways to express the same correct answer, making it hard to determine semantic equivalence through simple string matching, and output variations can make two semantically identical answers look completely different. Responses can be partially correct or contain a mix of accurate and inaccurate information. There can even be no single ground truth for the problem at hand, for example for tasks requiring to judge coherence, helpfulness, and style, which are inherently subjective and context-dependent.
|
| 117 |
|
|
|
|
| 550 |
- [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
|
| 551 |
</Note>
|
| 552 |
|
| 553 |
+
## The forgotten children of evaluation
|
| 554 |
|
| 555 |
+
### Confidence interval and statistical validity
|
| 556 |
|
| 557 |
When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
|
| 558 |
|
|
|
|
| 560 |
|
| 561 |
You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
|
| 562 |
|
| 563 |
+
### Cost and efficiency
|
| 564 |
+
|
| 565 |
+
When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
|
| 566 |
+
|
| 567 |
+
We suggest you report the following:
|
| 568 |
+
- **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
|
| 569 |
+
|
| 570 |
+
<Sidenote> These cost metrics can also be critical when comparing evaluation methods. For instance, while using a powerful LLM as a judge might provide better signal than automatic metrics, the 100x cost increase may not be justified for all use cases. Similarly, sampling-based metrics (pass@k, maj@n) multiply costs with the number of samples, which should be weighed against the improved signal they provide.</Sidenote>
|
| 571 |
+
|
| 572 |
+
- **Time**: Document the inference time required by the model to complete the evaluation. This includes both the actual inference time and any overhead from API rate limits. This is particularly important for any time-sensitive applications (like some agentic tool use, as in GAIA2).
|
| 573 |
+
|
| 574 |
+
<Image src={envImage} alt="Environmental impact metrics for model evaluation" width="400px" />
|
| 575 |
+
|
| 576 |
+
Last but not least, reporting the environmental footprint of the models you are running is becoming increasingly important with the overall state of resources available on earth. This includes carbon emissions from training and energy consumption at inference, and these will depend on the model size, hardware (if you know it) and the tokens generated. Some smaller or quantized models reach a very interesting performance to consumption ratio
|
| 577 |
+
|
| 578 |
|