Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>BaseBench</title> | |
| <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script> | |
| <style> | |
| body { | |
| font-family: Arial, sans-serif; | |
| max-width: 800px; | |
| margin: 0 auto; | |
| padding: 20px; | |
| transition: background-color 0.3s, color 0.3s; | |
| } | |
| body.dark-mode { | |
| background-color: #1a1a1a; | |
| color: #f0f0f0; | |
| } | |
| h1 { | |
| text-align: center; | |
| } | |
| #content { | |
| margin-top: 20px; | |
| } | |
| #theme-toggle { | |
| position: absolute; | |
| top: 10px; | |
| right: 10px; | |
| padding: 5px 10px; | |
| background-color: #4CAF50; | |
| color: white; | |
| border: none; | |
| cursor: pointer; | |
| } | |
| table { | |
| border-collapse: collapse; | |
| width: 100%; | |
| } | |
| th, td { | |
| border: 1px solid #ddd; | |
| padding: 8px; | |
| text-align: left; | |
| } | |
| .dark-mode th, .dark-mode td { | |
| border-color: #444; | |
| } | |
| </style> | |
| </head> | |
| <body> | |
| <h1>BaseBench</h1> | |
| <button id="theme-toggle">Dark/Light Theme</button> | |
| <div id="content"></div> | |
| <script> | |
| const markdown = ` | |
| BaseBench: A Foundational Language Model Evaluation Framework | |
| Description: | |
| BaseBench is a targeted evaluation framework designed to assess the fundamental capabilities of large language models across a spectrum of basic yet crucial tasks. This suite focuses on core competencies that serve as building blocks for more complex language understanding and generation. | |
| **Features**: | |
| 1. Encoding/Decoding Proficiency: Tests the model's ability to work with common encoding schemes like Base64 and ROT13, evaluating its understanding of data representation and transformation. | |
| 2. Basic Mathematical Reasoning: Assesses the model's capacity to perform simple arithmetic operations and mathematical problem-solving, gauging its numerical processing capabilities. | |
| 3. Linguistic Analysis: Examines the model's grasp of fundamental language properties such as character counting and frequency analysis, probing its understanding of word structure and composition. | |
| 4. Error Detection and Correction: Challenges the model to identify and rectify typographical errors, testing its language pattern recognition and error handling abilities (tokenization). | |
| **Purpose**: | |
| BaseBench aims to provide a clear, quantifiable measure of a language model's proficiency in these foundational areas. By focusing on these essential skills, the benchmark offers: | |
| 1. A standardized baseline for comparing different models or versions. | |
| 2. Insight into a model's fundamental processing capabilities. | |
| 3. A tool for identifying potential gaps in basic language and data handling skills. | |
| 4. A means to track incremental improvements in core model competencies. | |
| 5. Difficult enough to avoid saturation | |
| | Rank | Model | Accuracy | Time | Speed | | |
| |------|------------------------------------|--------------------|-------|-----------| | |
| | 1 | openai/gpt-4o | 59.00% (1475/2500) | 03:17 | 12.66it/s | | |
| | 2 | anthropic/claude-3.5-sonnet:beta | 52.56% (1314/2500) | 14:44 | 2.83it/s | | |
| | 3 | mistralai/mistral-large-2407 | 37.20% (930/2500) | 05:13 | 7.96it/s | | |
| | 4 | openai/gpt-4o-mini | 36.92% (923/2500) | 08:28 | 4.91it/s | | |
| | 5 | anthropic/claude-3-haiku:beta | 36.72% (918/2500) | 06:20 | 6.57it/s | | |
| | 6 | google/gemini-pro-1.5 | 26.92% (673/2500) | 03:05 | 13.51it/s | | |
| | 7 | google/gemma-2-27b-it | 25.24% (631/2500) | 05:52 | 7.08it/s | | |
| | 8 | meta-llama/llama-3.1-405b-instruct | 24.24% (606/2500) | 07:19 | 5.69it/s | | |
| | 9 | 01-ai/yi-large | 20.68% (517/2500) | 02:37 | 15.83it/s | | |
| | 10 | mistralai/mixtral-8x22b-instruct | 19.60% (490/2500) | 04:32 | 9.18it/s | | |
| | 11 | meta-llama/llama-3.1-70b-instruct | 19.04% (476/2500) | 18:01 | 2.31it/s | | |
| **Insights**: | |
| - GPT models lead (only Anthropic's flagship manages to beat 4o-mini) | |
| - Mistral Large is an outlier, however it beats GPT-4o-mini easily (also corresponding to the MMLU-Pro score) | |
| - Llama models score fairly low | |
| - Closed source models/proprietry tend to score better (Mistral Large), due to training differences? | |
| - Gemini is fast, however quality is comparable to Gemma | |
| `; | |
| document.addEventListener('DOMContentLoaded', function() { | |
| const content = document.getElementById('content'); | |
| content.innerHTML = marked.parse(markdown); | |
| const themeToggle = document.getElementById('theme-toggle'); | |
| themeToggle.addEventListener('click', function() { | |
| document.body.classList.toggle('dark-mode'); | |
| }); | |
| }); | |
| </script> | |
| </body> | |
| </html> |