Spaces:
Running
Running
add dataset card
Browse files- index.html +31 -0
index.html
CHANGED
|
@@ -54,6 +54,30 @@
|
|
| 54 |
|
| 55 |
<script>
|
| 56 |
const markdown = `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
| Rank | Model | Accuracy | Time | Speed |
|
| 58 |
|------|------------------------------------|--------------------|-------|-----------|
|
| 59 |
| 1 | openai/gpt-4o | 59.00% (1475/2500) | 03:17 | 12.66it/s |
|
|
@@ -67,6 +91,13 @@
|
|
| 67 |
| 9 | 01-ai/yi-large | 20.68% (517/2500) | 02:37 | 15.83it/s |
|
| 68 |
| 10 | mistralai/mixtral-8x22b-instruct | 19.60% (490/2500) | 04:32 | 9.18it/s |
|
| 69 |
| 11 | meta-llama/llama-3.1-70b-instruct | 19.04% (476/2500) | 18:01 | 2.31it/s |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
`;
|
| 71 |
|
| 72 |
document.addEventListener('DOMContentLoaded', function() {
|
|
|
|
| 54 |
|
| 55 |
<script>
|
| 56 |
const markdown = `
|
| 57 |
+
BaseBench: A Foundational Language Model Evaluation Framework
|
| 58 |
+
|
| 59 |
+
Description:
|
| 60 |
+
BaseBench is a targeted evaluation framework designed to assess the fundamental capabilities of large language models across a spectrum of basic yet crucial tasks. This suite focuses on core competencies that serve as building blocks for more complex language understanding and generation.
|
| 61 |
+
|
| 62 |
+
**Features**:
|
| 63 |
+
|
| 64 |
+
1. Encoding/Decoding Proficiency: Tests the model's ability to work with common encoding schemes like Base64 and ROT13, evaluating its understanding of data representation and transformation.
|
| 65 |
+
|
| 66 |
+
2. Basic Mathematical Reasoning: Assesses the model's capacity to perform simple arithmetic operations and mathematical problem-solving, gauging its numerical processing capabilities.
|
| 67 |
+
|
| 68 |
+
3. Linguistic Analysis: Examines the model's grasp of fundamental language properties such as character counting and frequency analysis, probing its understanding of word structure and composition.
|
| 69 |
+
|
| 70 |
+
4. Error Detection and Correction: Challenges the model to identify and rectify typographical errors, testing its language pattern recognition and error handling abilities (tokenization).
|
| 71 |
+
|
| 72 |
+
**Purpose**:
|
| 73 |
+
BaseBench aims to provide a clear, quantifiable measure of a language model's proficiency in these foundational areas. By focusing on these essential skills, the benchmark offers:
|
| 74 |
+
|
| 75 |
+
1. A standardized baseline for comparing different models or versions.
|
| 76 |
+
2. Insight into a model's fundamental processing capabilities.
|
| 77 |
+
3. A tool for identifying potential gaps in basic language and data handling skills.
|
| 78 |
+
4. A means to track incremental improvements in core model competencies.
|
| 79 |
+
5. Difficult enough to avoid saturation
|
| 80 |
+
|
| 81 |
| Rank | Model | Accuracy | Time | Speed |
|
| 82 |
|------|------------------------------------|--------------------|-------|-----------|
|
| 83 |
| 1 | openai/gpt-4o | 59.00% (1475/2500) | 03:17 | 12.66it/s |
|
|
|
|
| 91 |
| 9 | 01-ai/yi-large | 20.68% (517/2500) | 02:37 | 15.83it/s |
|
| 92 |
| 10 | mistralai/mixtral-8x22b-instruct | 19.60% (490/2500) | 04:32 | 9.18it/s |
|
| 93 |
| 11 | meta-llama/llama-3.1-70b-instruct | 19.04% (476/2500) | 18:01 | 2.31it/s |
|
| 94 |
+
|
| 95 |
+
**Insights**:
|
| 96 |
+
- GPT models lead (only Anthropic's flagship manages to beat 4o-mini)
|
| 97 |
+
- Mistral Large is an outlier, however it beats GPT-4o-mini easily (also corresponding to the MMLU-Pro score)
|
| 98 |
+
- Llama models score fairly low
|
| 99 |
+
- Closed source models/proprietry tend to score better (Mistral Large), due to training differences?
|
| 100 |
+
- Gemini is fast, however quality is comparable to Gemma
|
| 101 |
`;
|
| 102 |
|
| 103 |
document.addEventListener('DOMContentLoaded', function() {
|