Clémentine commited on
Commit
a940a62
·
1 Parent(s): 5737cc1
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED
@@ -322,8 +322,8 @@ Once you've selected your model, you need to define what is the best possible pr
322
 
323
  <Note title="Prompt design guidelines" emoji="📝" variant="info">
324
  Provide a clear description of the task at hand:
325
- - *Your task is to do X*.
326
- - *You will be provided with Y*.
327
 
328
  Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
329
  - *You should evaluate property Z on a scale of 1 - 5, where 1 means ...*
@@ -333,7 +333,7 @@ Provide some additional "reasoning" evaluation steps:
333
  - *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
334
 
335
  Specify the desired output format (adding fields will help consistency)
336
- - *Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score}*
337
  </Note>
338
 
339
  You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
@@ -341,7 +341,7 @@ You can and should take inspiration from [MixEval](https://github.com/huggingfac
341
  <Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
342
  Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
343
 
344
- If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (`provide 1 point for this characteristic of the answer, 1 additional point if ...` etc)
345
 
346
  Using one prompt per capability to score tends to give better and more robust results
347
  </Note>
 
322
 
323
  <Note title="Prompt design guidelines" emoji="📝" variant="info">
324
  Provide a clear description of the task at hand:
325
+ - *Your task is to do X.*
326
+ - *You will be provided with Y.*
327
 
328
  Provide clear instructions on the evaluation criteria, including a detailed scoring system if needed:
329
  - *You should evaluate property Z on a scale of 1 - 5, where 1 means ...*
 
333
  - *To judge this task, you must first make sure to read sample Y carefully to identify ..., then ...*
334
 
335
  Specify the desired output format (adding fields will help consistency)
336
+ - *Your answer should be provided in JSON, with the following format \{"Score": Your score, "Reasoning": The reasoning which led you to this score\}*
337
  </Note>
338
 
339
  You can and should take inspiration from [MixEval](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mix_eval/judge_prompts.pyy) or [MTBench](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/extended/mt_bench/judge_prompt_templates.py) prompt templates.
 
341
  <Note title="To remember when doing model as judge" emoji="⚠️" variant="warning">
342
  Pairwise comparison [correlates better with human preference](https://arxiv.org/abs/2403.16950) than scoring, and is more robust generally.
343
 
344
+ If you really want a score, use an integer scale make sure you provide a detailed explanation for what [each score represents](https://x.com/seungonekim/status/1749289437165769177), or an additive prompt (*provide 1 point for this characteristic of the answer, 1 additional point if ...* etc)
345
 
346
  Using one prompt per capability to score tends to give better and more robust results
347
  </Note>