open-tw-llm-leaderboard

Running

App Files Files Community

yentinglin commited on May 26, 2024

Commit

a92da22

verified ·

1 Parent(s): 6edca0c

Update src/about.py

Browse files

Files changed (1) hide show

src/about.py +6 -6

src/about.py CHANGED Viewed

@@ -23,7 +23,7 @@ NUM_FEWSHOT = 0 # Change with your few shot
 # Your leaderboard name
-TITLE = """<h1 align="center" id="space-title">Demo leaderboard</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
@@ -36,7 +36,7 @@ LLM_BENCHMARKS_TEXT = f"""
 The leaderboard evaluates LLMs on the following benchmarks:
 1. TMLU (Taiwanese Mandarin Language Understanding): Measures the model's ability to understand Taiwanese Mandarin text across various domains.
-2. TW Truthful QA: Assesses the model's capability to provide truthful answers to questions in Taiwanese Mandarin, with a focus on Taiwan-specific context.
 3. TW Legal Eval: Evaluates the model's understanding of legal terminology and concepts in Taiwanese Mandarin, using questions from the Taiwanese bar exam for lawyers.
 4. MMLU (Massive Multitask Language Understanding): Tests the model's performance on a wide range of tasks in English.
@@ -44,10 +44,10 @@ To reproduce our results, please follow the instructions in the provided GitHub
 該排行榜在以下考題上評估 LLMs:
-1. TMLU(臺灣國語語言理解):衡量模型理解各個領域臺灣國語文本的能力。
-2. TW Truthful QA:評估模型以臺灣國語提供真實答案的能力,重點關注臺灣特定的背景。
-3. TW Legal Eval:使用臺灣律師資格考試的問題,評估模型對臺灣國語法律術語和概念的理解。
-4. MMLU(大規模多任務語言理解):測試模型在英語中各種任務上的表現。
 要重現我們的結果,請按照:https://github.com/adamlin120/lm-evaluation-harness/blob/main/run_all.sh
 """

 # Your leaderboard name
+TITLE = """<h1 align="center" id="space-title">Open Taiwan LLM leaderboard</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 The leaderboard evaluates LLMs on the following benchmarks:
 1. TMLU (Taiwanese Mandarin Language Understanding): Measures the model's ability to understand Taiwanese Mandarin text across various domains.
+2. TW Truthful QA: Assesses the model's capability to provide truthful and localized answers to questions in Taiwanese Mandarin, with a focus on Taiwan-specific context.
 3. TW Legal Eval: Evaluates the model's understanding of legal terminology and concepts in Taiwanese Mandarin, using questions from the Taiwanese bar exam for lawyers.
 4. MMLU (Massive Multitask Language Understanding): Tests the model's performance on a wide range of tasks in English.
 該排行榜在以下考題上評估 LLMs:
+1. [TMLU(臺灣中文大規模多任務語言理解)](https://huggingface.co/datasets/miulab/tmlu):衡量模型理解各個領域（國中、高中、大學、國考）的能力。
+2. TW Truthful QA:評估模型以臺灣特定的背景來回答問題，測試模型的在地化能力。
+3. [TW Legal Eval](https://huggingface.co/datasets/lianghsun/tw-legal-benchmark-v1):使用臺灣律師資格考試的問題,評估模型對臺灣法律術語和概念的理解。
+4. [MMLU(英文大規模多任務語言理解)](https://huggingface.co/datasets/cais/mmlu):測試模型在英語中各種任務上的表現。
 要重現我們的結果,請按照:https://github.com/adamlin120/lm-evaluation-harness/blob/main/run_all.sh
 """