Clémentine commited on
Commit
62c875d
·
1 Parent(s): 57240ca
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -184,9 +184,9 @@ Common sampling-based metrics are:
184
  When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
185
 
186
  <Note title="When can you use sampling and when shouldn't you?">
187
- **For training evaluation/ablations**: ❌ Generally avoid sampling metrics as they're expensive and add variance. Stick to greedy decoding with a fixed seed.
188
- **For post-training evaluation**: ✅ Sampling metrics can reveal capabilities that greedy decoding misses (especially for more complex tasks requiring reasoning, math or code).
189
- **At inference**: ✅ These metrics help estimate how much improvement you can get from sampling multiple times at inference. It's particularly cool when you want to study how far you can push small models with test time compute.
190
 
191
  However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
192
  </Note>
 
184
  When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
185
 
186
  <Note title="When can you use sampling and when shouldn't you?">
187
+ - **For training evaluation/ablations**: ❌ Generally avoid sampling metrics as they're expensive and add variance. Stick to greedy decoding with a fixed seed.
188
+ - **For post-training evaluation**: ✅ Sampling metrics can reveal capabilities that greedy decoding misses (especially for more complex tasks requiring reasoning, math or code).
189
+ - **At inference**: ✅ These metrics help estimate how much improvement you can get from sampling multiple times at inference. It's particularly cool when you want to study how far you can push small models with test time compute.
190
 
191
  However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
192
  </Note>