evaluation-guidebook

Running

Clémentine commited on 9 days ago

Commit

dc8d285

1 Parent(s): 81747d0

rephrase

Files changed (1) hide show

app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx CHANGED Viewed

@@ -9,14 +9,12 @@ import Sidenote from "../../../components/Sidenote.astro";
 ### My results are very bad
-The first thing to do is always to inspect your model generations in detail. Some frequent things to look for when troubleshooting are:
-- too strict model output parsing (before computing the metric) which leads to the answer being lost
-    - Fixing: adapt your parsing
-- unability of the models to follow your output format in few shot (frequent in recent models trained with instructions data, like llama 3.2 or Qwen 2.5)
-    - Fixing: either adapt your prompt format, or just assume that models should be able to follow it in few shot
-- exceedingly verbose model which never gets to the correct answer (more frequent in long context models and something we observed with Qwen and CommandR models)
-    - Fixing: either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly
 ### My model is very slow!
 ➡️ Changing the batch size

 ### My results are very bad
+The first thing to do is always to inspect your model generations in detail.
+Some frequent problems you should look for when troubleshooting are:
+- Is your model output parsing too strict before computing the metric? It can lead to the answer being lost (obvious fix is to make it less strict, but you'll get more false positives!)
+- Is your model struggling to follow your output format in few shot? This frequently happens in recent models trained on too specific evaluation formats, and you can either adapt your prompt format, or just state that models should be able to follow it and that the ones struggling are not good enough for the task you are considering.
+- Is your model exceedingly verbose? In this case, it likely never gets to the correct answer - this is more frequent in long context models (we observed it with Qwen and Command R models in 2024) and reasoning models, especially if the tasks stops generation too soon. You can either increase the allowed context length, add instructions to be concise in the task prompt, or just assume that models should be able to answer succinctly.
 ### My model is very slow!
 ➡️ Changing the batch size