Clémentine
commited on
Commit
·
a940a62
1
Parent(s):
5737cc1
fix
Browse files
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx
CHANGED
|
@@ -322,8 +322,8 @@ Once you've selected your model, you need to define what is the best possible pr
|
|
| 322 |
|
| 323 |
<Note title="Prompt design guidelines" emoji="📝" variant="info">
|
| 324 |
Provide a clear description of the task at hand:
|
| 325 |
-
- *Your task is to do X
|
| 326 |
-
- *You will be provided with Y
|
| 327 |
|
| 328 |
Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
|
| 329 |
- *You should evaluate property Z on a scale of 1 - 5, where 1 means ...*
|
|
@@ -333,7 +333,7 @@ Provide some additional "reasoning" evaluation steps:
|
|
| 333 |
- *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
|
| 334 |
|
| 335 |
Specify the desired output format (adding fields will help consistency)
|
| 336 |
-
- *Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}*
|
| 337 |
</Note>
|
| 338 |
|
| 339 |
You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
|
|
@@ -341,7 +341,7 @@ You can and should take inspiration from [MixEval](https://github.com/huggingfac
|
|
| 341 |
<Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
|
| 342 |
Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
|
| 343 |
|
| 344 |
-
If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (
|
| 345 |
|
| 346 |
Using one prompt per capability to score tends to give better and more robust results
|
| 347 |
</Note>
|
|
|
|
| 322 |
|
| 323 |
<Note title="Prompt design guidelines" emoji="📝" variant="info">
|
| 324 |
Provide a clear description of the task at hand:
|
| 325 |
+
- *Your task is to do X.*
|
| 326 |
+
- *You will be provided with Y.*
|
| 327 |
|
| 328 |
Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
|
| 329 |
- *You should evaluate property Z on a scale of 1 - 5, where 1 means ...*
|
|
|
|
| 333 |
- *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
|
| 334 |
|
| 335 |
Specify the desired output format (adding fields will help consistency)
|
| 336 |
+
- *Your answer should be provided in JSON, with the following format \{"Score": Your score, "Reasoning": The reasoning which led you to this score\}*
|
| 337 |
</Note>
|
| 338 |
|
| 339 |
You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
|
|
|
|
| 341 |
<Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
|
| 342 |
Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
|
| 343 |
|
| 344 |
+
If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (*provide 1 point for this characteristic of the answer, 1 additional point if ...* etc)
|
| 345 |
|
| 346 |
Using one prompt per capability to score tends to give better and more robust results
|
| 347 |
</Note>
|