Clémentine commited on
Commit
427b813
·
1 Parent(s): 6b1cc14

fixed text

Browse files
app/src/content/assets/image/env.png ADDED

Git LFS Details

  • SHA256: fbcef0931fde4128efb8451e71e3c1bdbd663e2eb8e82f2bf982a3f048f376a7
  • Pointer size: 131 Bytes
  • Size of remote file: 659 kB
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -5,7 +5,9 @@ title: "Designing your automatic evaluation"
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
7
  import HtmlEmbed from "../../../components/HtmlEmbed.astro";
 
8
  import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
 
9
 
10
  ### Dataset
11
 
@@ -109,7 +111,7 @@ If you are looking at **log-probabilities**, your metrics are going to be easy:
109
 
110
  If you're looking at **generative** evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
111
 
112
- ## The hardest part of evaluation: Scoring free form text
113
 
114
  Scoring free-form text is tricky because there are typically many different ways to express the same correct answer, making it hard to determine semantic equivalence through simple string matching, and output variations can make two semantically identical answers look completely different. Responses can be partially correct or contain a mix of accurate and inaccurate information. There can even be no single ground truth for the problem at hand, for example for tasks requiring to judge coherence, helpfulness, and style, which are inherently subjective and context-dependent.
115
 
@@ -548,9 +550,9 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
548
  - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
549
  </Note>
550
 
551
- ### The forgotten children of evaluation
552
 
553
- #### Confidence
554
 
555
  When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
556
 
@@ -558,5 +560,19 @@ These confidence intervals from the raw scores can be obtained from standard dev
558
 
559
  You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
560
 
561
- #### Cost
 
 
 
 
 
 
 
 
 
 
 
 
 
 
562
 
 
5
  import Note from "../../../components/Note.astro";
6
  import Sidenote from "../../../components/Sidenote.astro";
7
  import HtmlEmbed from "../../../components/HtmlEmbed.astro";
8
+ import Image from "../../../components/Image.astro";
9
  import UsingHumanAnnotators from "../human-evaluation/using-human-annotators.mdx";
10
+ import envImage from '../../assets/image/env.png';
11
 
12
  ### Dataset
13
 
 
111
 
112
  If you're looking at **generative** evaluations, this is where it gets trickyyy, so the next chapter is specifically on this!
113
 
114
+ ## Evaluation's main challenge: Scoring free form text
115
 
116
  Scoring free-form text is tricky because there are typically many different ways to express the same correct answer, making it hard to determine semantic equivalence through simple string matching, and output variations can make two semantically identical answers look completely different. Responses can be partially correct or contain a mix of accurate and inaccurate information. There can even be no single ground truth for the problem at hand, for example for tasks requiring to judge coherence, helpfulness, and style, which are inherently subjective and context-dependent.
117
 
 
550
  - [Interleaved generation](https://github.com/guidance-ai/guidance?tab=readme-ov-file#guidance-acceleration), another method to constrain generations for some specific output formats
551
  </Note>
552
 
553
+ ## The forgotten children of evaluation
554
 
555
+ ### Confidence interval and statistical validity
556
 
557
  When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
558
 
 
560
 
561
  You can also compute these with prompt variations, by asking the same questions in slightly different ways, or re-running on the same samples with different prompt formats.
562
 
563
+ ### Cost and efficiency
564
+
565
+ When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
566
+
567
+ We suggest you report the following:
568
+ - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
569
+
570
+ <Sidenote> These cost metrics can also be critical when comparing evaluation methods. For instance, while using a powerful LLM as a judge might provide better signal than automatic metrics, the 100x cost increase may not be justified for all use cases. Similarly, sampling-based metrics (pass@k, maj@n) multiply costs with the number of samples, which should be weighed against the improved signal they provide.</Sidenote>
571
+
572
+ - **Time**: Document the inference time required by the model to complete the evaluation. This includes both the actual inference time and any overhead from API rate limits. This is particularly important for any time-sensitive applications (like some agentic tool use, as in GAIA2).
573
+
574
+ <Image src={envImage} alt="Environmental impact metrics for model evaluation" width="400px" />
575
+
576
+ Last but not least, reporting the environmental footprint of the models you are running is becoming increasingly important with the overall state of resources available on earth. This includes carbon emissions from training and energy consumption at inference, and these will depend on the model size, hardware (if you know it) and the tokens generated. Some smaller or quantized models reach a very interesting performance to consumption ratio
577
+
578