evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 5 days ago

Commit

f69c124

1 Parent(s): 427b813

meme

Browse files

Files changed (1) hide show

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +6 -3

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -552,7 +552,7 @@ However, some recent [research](https://arxiv.org/abs/2408.02442) seems to show
 ## The forgotten children of evaluation
-### Confidence interval and statistical validity
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
@@ -562,8 +562,11 @@ You can also compute these with prompt variations, by asking the same questions
 ### Cost and efficiency
 When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
 We suggest you report the following:
 - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
@@ -571,8 +574,8 @@ We suggest you report the following:
 - **Time**: Document the inference time required by the model to complete the evaluation. This includes both the actual inference time and any overhead from API rate limits. This is particularly important for any time-sensitive applications (like some agentic tool use, as in GAIA2).
-<Image src={envImage} alt="Environmental impact metrics for model evaluation" width="400px" />
 Last but not least, reporting the environmental footprint of the models you are running is becoming increasingly important with the overall state of resources available on earth. This includes carbon emissions from training and energy consumption at inference, and these will depend on the model size, hardware (if you know it) and the tokens generated. Some smaller or quantized models reach a very interesting performance to consumption ratio

 ## The forgotten children of evaluation
+### Statistical validity
 When reporting evaluation results, it's critical to include **confidence intervals** alongside point estimates.
 ### Cost and efficiency
 When designing and reporting evaluation results, we need to start collectively reporting results against model running costs! A reasoning model which requires 10 minutes of thinking and 10K tokens to answer 10 + 20 (because it decides to make an entire segue on binary vs decimal arithmetics) is considerably less efficient than a smol model answering 30 in a handful of tokens.
+<img src={envImage.src} alt="Environmental impact metrics for model evaluation" style="max-width: 400px; height: auto; display: block; margin: 0 auto;" />
 We suggest you report the following:
 - **Token consumption**: Report the total number of output tokens used during evaluation. This is particularly important to estimate **efficiency**, and it will affect the cost of model as judge evaluations. Token counts directly impact monetary costs and help others estimate the computational requirements. **Monetary cost** can also be a good proxy for efficiency.
 - **Time**: Document the inference time required by the model to complete the evaluation. This includes both the actual inference time and any overhead from API rate limits. This is particularly important for any time-sensitive applications (like some agentic tool use, as in GAIA2).
 Last but not least, reporting the environmental footprint of the models you are running is becoming increasingly important with the overall state of resources available on earth. This includes carbon emissions from training and energy consumption at inference, and these will depend on the model size, hardware (if you know it) and the tokens generated. Some smaller or quantized models reach a very interesting performance to consumption ratio