updated benchmark description.
Browse files
README.md
CHANGED
|
@@ -144,7 +144,7 @@ We used llm-jp-eval(v1.3.0), JP Language Model Evaluation Harness(commit #9b42d4
|
|
| 144 |
- Automatic summarization (XL-Sum [Hasan et al., 2021])
|
| 145 |
- Machine translation (WMT2020 ja-en [Barrault et al., 2020])
|
| 146 |
- Machine translation (WMT2020 en-ja [Barrault et al., 2020])
|
| 147 |
-
-
|
| 148 |
- Academic exams (JMMLU [尹ら, 2024])
|
| 149 |
- Code generation (JHumanEval [佐藤ら, 2024])
|
| 150 |
|
|
@@ -157,7 +157,7 @@ We used the Language Model Evaluation Harness(v.0.4.2) and Code Generation LM Ev
|
|
| 157 |
- Machine reading comprehension (SQuAD2 [Rajpurkar et al., 2018])
|
| 158 |
- Commonsense reasoning (XWINO [Tikhonov and Ryabinin, 2021])
|
| 159 |
- Natural language inference (HellaSwag [Zellers et al., 2019])
|
| 160 |
-
-
|
| 161 |
- Mathematical reasoning (MATH [Hendrycks et al., 2022][Lightman et al., 2024])
|
| 162 |
- Reasoning (BBH (BIG-Bench-Hard) [Suzgun et al., 2023])
|
| 163 |
- Academic exams (MMLU [Hendrycks et al., 2021])
|
|
|
|
| 144 |
- Automatic summarization (XL-Sum [Hasan et al., 2021])
|
| 145 |
- Machine translation (WMT2020 ja-en [Barrault et al., 2020])
|
| 146 |
- Machine translation (WMT2020 en-ja [Barrault et al., 2020])
|
| 147 |
+
- Arithmetic reasoning (MGSM [Shi et al., 2023])
|
| 148 |
- Academic exams (JMMLU [尹ら, 2024])
|
| 149 |
- Code generation (JHumanEval [佐藤ら, 2024])
|
| 150 |
|
|
|
|
| 157 |
- Machine reading comprehension (SQuAD2 [Rajpurkar et al., 2018])
|
| 158 |
- Commonsense reasoning (XWINO [Tikhonov and Ryabinin, 2021])
|
| 159 |
- Natural language inference (HellaSwag [Zellers et al., 2019])
|
| 160 |
+
- Arithmetic reasoning (GSM8K [Cobbe et al., 2021])
|
| 161 |
- Mathematical reasoning (MATH [Hendrycks et al., 2022][Lightman et al., 2024])
|
| 162 |
- Reasoning (BBH (BIG-Bench-Hard) [Suzgun et al., 2023])
|
| 163 |
- Academic exams (MMLU [Hendrycks et al., 2021])
|