Clémentine commited on
Commit
f69c124
·
1 Parent(s): 427b813
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -552,7 +552,7 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
552
 
553
  ## The forgotten children of evaluation
554
 
555
- ### Confidence interval and statistical validity
556
 
557
  When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
558
 
@@ -562,8 +562,11 @@ You can also compute these with prompt variations, by asking the same questions
562
 
563
  ### Cost and efficiency
564
 
 
565
  When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
566
 
 
 
567
  We suggest you report the following:
568
  - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
569
 
@@ -571,8 +574,8 @@ We suggest you report the following:
571
 
572
  - **Time**: Document the inference time required by the model to complete the evaluation. This includes both the actual inference time and any overhead from API rate limits. This is particularly important for any time-sensitive applications (like some agentic tool use, as in GAIA2).
573
 
574
- <Image src={envImage} alt="Environmental impact metrics for model evaluation" width="400px" />
575
-
576
  Last but not least, reporting the environmental footprint of the models you are running is becoming increasingly important with the overall state of resources available on earth. This includes carbon emissions from training and energy consumption at inference, and these will depend on the model size, hardware (if you know it) and the tokens generated. Some smaller or quantized models reach a very interesting performance to consumption ratio
577
 
578
 
 
 
 
552
 
553
  ## The forgotten children of evaluation
554
 
555
+ ### Statistical validity
556
 
557
  When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
558
 
 
562
 
563
  ### Cost and efficiency
564
 
565
+
566
  When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
567
 
568
+ <img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="max-width: 400px; height: auto; display: block; margin: 0 auto;" />
569
+
570
  We suggest you report the following:
571
  - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
572
 
 
574
 
575
  - **Time**: Document the inference time required by the model to complete the evaluation. This includes both the actual inference time and any overhead from API rate limits. This is particularly important for any time-sensitive applications (like some agentic tool use, as in GAIA2).
576
 
 
 
577
  Last but not least, reporting the environmental footprint of the models you are running is becoming increasingly important with the overall state of resources available on earth. This includes carbon emissions from training and energy consumption at inference, and these will depend on the model size, hardware (if you know it) and the tokens generated. Some smaller or quantized models reach a very interesting performance to consumption ratio
578
 
579
 
580
+
581
+