Spaces:

nvidia
/

ProfBench

Running

App Files Files Community

zhilinw commited on Oct 30

Commit

6e03470

verified ·

1 Parent(s): d43cc64

Upload app.py

Browse files

Files changed (1) hide show

app.py +5 -4

app.py CHANGED Viewed

@@ -154,7 +154,7 @@ with gr.Blocks(theme=theme) as app:
         with gr.TabItem("LLM Judge"):
             with gr.Row():
-                gr.Markdown("LLM Judge Leaderboard: LLM Judges are evaluated based on whether they can accurately predict the human-labelled criterion fulfilment across 3 different models (o3, Grok4, R1-0528). We consider not only macro-F1 but also whether LLM-Judge display bias towards/against any models using a Bias Index. The Overall score is calculated based on Overall F1 - Bias Index.")
             with gr.Tabs(elem_id="inner-tabs", elem_classes="tabs-small") as tabs:
                 with gr.TabItem("Leaderboard"):
                     with gr.Row():
@@ -260,14 +260,15 @@ with gr.Blocks(theme=theme) as app:
     with gr.Row():
         with gr.Accordion("📚 Frequently Asked Questions", open=False):
             citation_button = gr.Textbox(
-                value=r"""1. How is the cost calculated?: We use the token cost from https://openrouter.ai/models multipled by the total input/output tokens in each evaluation.""",
-                lines=1,
                 label="FAQ",
                 elem_id="faq_box",
             )
     with gr.Row():
-        with gr.Accordion("📚 Understand our metrics", open=False):
             citation_button = gr.Textbox(
                 value=r"""Response Generation (w Docs): We first generate the response. Then we grade the response against the human-annotated rubrics. Finally, we calculate the proportion of rubrics satisfied by each response, weighted by their criterion-weight to derive a score for each response.
 LLM Judge: We calculate macro-F1 of the LLM-judge predicted criteria-fulfillment against the human-labelled criterion fulfillment to get Overall F1. We then calculate the bias for each model by taking mean of predicted fulfilment minus mean of human-labelled fulfilment. We calculate Bias Index by taking max(bias) - min(bias) across models. Overall is calculated by Overall F1 - Bias Index.""",

         with gr.TabItem("LLM Judge"):
             with gr.Row():
+                gr.Markdown("LLM Judge Leaderboard: LLM Judges are evaluated based on whether they can accurately predict the human-labelled criterion fulfilment across 3 different models (o3, Grok4, R1-0528). We consider not only macro-F1 across 3486 samples but also whether LLM-Judge display bias towards/against any models using a Bias Index. The Overall score is calculated based on Overall F1 - Bias Index.")
             with gr.Tabs(elem_id="inner-tabs", elem_classes="tabs-small") as tabs:
                 with gr.TabItem("Leaderboard"):
                     with gr.Row():
     with gr.Row():
         with gr.Accordion("📚 Frequently Asked Questions", open=False):
             citation_button = gr.Textbox(
+                value=r"""1. How is the cost calculated?: We use the token cost from https://openrouter.ai/models multipled by the total input/output tokens in each evaluation.
+2. How can I run Report Generation Leaderboard with Grounding Documents: This benchmark is unable to be run externally at the moment since we are unable to release the required grounding documents. We are working on it.""",
+                lines=2,
                 label="FAQ",
                 elem_id="faq_box",
             )
     with gr.Row():
+        with gr.Accordion("📚 Understand the Metrics", open=False):
             citation_button = gr.Textbox(
                 value=r"""Response Generation (w Docs): We first generate the response. Then we grade the response against the human-annotated rubrics. Finally, we calculate the proportion of rubrics satisfied by each response, weighted by their criterion-weight to derive a score for each response.
 LLM Judge: We calculate macro-F1 of the LLM-judge predicted criteria-fulfillment against the human-labelled criterion fulfillment to get Overall F1. We then calculate the bias for each model by taking mean of predicted fulfilment minus mean of human-labelled fulfilment. We calculate Bias Index by taking max(bias) - min(bias) across models. Overall is calculated by Overall F1 - Bias Index.""",