Clémentine
commited on
Commit
·
62c875d
1
Parent(s):
57240ca
wip
Browse files
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx
CHANGED
|
@@ -184,9 +184,9 @@ Common sampling-based metrics are:
|
|
| 184 |
When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
|
| 185 |
|
| 186 |
<Note title="When can you use sampling and when shouldn't you?">
|
| 187 |
-
**For training evaluation/ablations**: ❌ Generally avoid sampling metrics as they're expensive and add variance. Stick to greedy decoding with a fixed seed.
|
| 188 |
-
**For post-training evaluation**: ✅ Sampling metrics can reveal capabilities that greedy decoding misses (especially for more complex tasks requiring reasoning, math or code).
|
| 189 |
-
**At inference**: ✅ These metrics help estimate how much improvement you can get from sampling multiple times at inference. It's particularly cool when you want to study how far you can push small models with test time compute.
|
| 190 |
|
| 191 |
However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
|
| 192 |
</Note>
|
|
|
|
| 184 |
When you use sampling evaluations, make sure to always report all sampling parameters (temperature, top-p, k value) as they significantly affect results.
|
| 185 |
|
| 186 |
<Note title="When can you use sampling and when shouldn't you?">
|
| 187 |
+
- **For training evaluation/ablations**: ❌ Generally avoid sampling metrics as they're expensive and add variance. Stick to greedy decoding with a fixed seed.
|
| 188 |
+
- **For post-training evaluation**: ✅ Sampling metrics can reveal capabilities that greedy decoding misses (especially for more complex tasks requiring reasoning, math or code).
|
| 189 |
+
- **At inference**: ✅ These metrics help estimate how much improvement you can get from sampling multiple times at inference. It's particularly cool when you want to study how far you can push small models with test time compute.
|
| 190 |
|
| 191 |
However, keep in mind that sampling k times multiplies your evaluation cost by k. For expensive models or large datasets, this adds up very quickly!
|
| 192 |
</Note>
|