evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 10 days ago

Commit

a940a62

1 Parent(s): 5737cc1

fix

Browse files

Files changed (1) hide show

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +4 -4

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -322,8 +322,8 @@ Once you've selected your model, you need to define what is the best possible pr
 <Note title="Prompt design guidelines" emoji="📝" variant="info">
 Provide a clear description of the task at hand:
-- *Your task is to do X*.
-- *You will be provided with Y*.
 Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
 - *You should evaluate property Z on a scale of 1 - 5, where 1 means ...*
@@ -333,7 +333,7 @@ Provide some additional "reasoning" evaluation steps:
 - *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
 Specify the desired output format (adding fields will help consistency)
-- *Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}*
 </Note>
 You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
@@ -341,7 +341,7 @@ You can and should take inspiration from [MixEval](https://github.com/huggingfac
 <Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
 Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
-If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
 Using one prompt per capability to score tends to give better and more robust results
 </Note>

 <Note title="Prompt design guidelines" emoji="📝" variant="info">
 Provide a clear description of the task at hand:
+- *Your task is to do X.*
+- *You will be provided with Y.*
 Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
 - *You should evaluate property Z on a scale of 1 - 5, where 1 means ...*
 - *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
 Specify the desired output format (adding fields will help consistency)
+- *Your answer should be provided in JSON, with the following format \{"Score": Your score, "Reasoning": The reasoning which led you to this score\}*
 </Note>
 You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
 <Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
 Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
+If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (*provide 1 point for this characteristic of the answer, 1 additional point if ...* etc)
 Using one prompt per capability to score tends to give better and more robust results
 </Note>