Spaces:

leafspark
/

BaseBench

Running

App Files Files Community

BaseBench / index.html

leafspark

add dataset card

ca539f3 verified over 1 year ago

raw

history blame contribute delete

5.07 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<title>BaseBench</title>
	<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
	<style>
	body {
	font-family: Arial, sans-serif;
	max-width: 800px;
	margin: 0 auto;
	padding: 20px;
	transition: background-color 0.3s, color 0.3s;
	}
	body.dark-mode {
	background-color: #1a1a1a;
	color: #f0f0f0;
	}
	h1 {
	text-align: center;
	}
	#content {
	margin-top: 20px;
	}
	#theme-toggle {
	position: absolute;
	top: 10px;
	right: 10px;
	padding: 5px 10px;
	background-color: #4CAF50;
	color: white;
	border: none;
	cursor: pointer;
	}
	table {
	border-collapse: collapse;
	width: 100%;
	}
	th, td {
	border: 1px solid #ddd;
	padding: 8px;
	text-align: left;
	}
	.dark-mode th, .dark-mode td {
	border-color: #444;
	}
	</style>
	</head>
	<body>
	<h1>BaseBench</h1>
	<button id="theme-toggle">Dark/Light Theme</button>
	<div id="content"></div>

	<script>
	const markdown = `
	BaseBench: A Foundational Language Model Evaluation Framework

	Description:
	BaseBench is a targeted evaluation framework designed to assess the fundamental capabilities of large language models across a spectrum of basic yet crucial tasks. This suite focuses on core competencies that serve as building blocks for more complex language understanding and generation.

	Features:

	1. Encoding/Decoding Proficiency: Tests the model's ability to work with common encoding schemes like Base64 and ROT13, evaluating its understanding of data representation and transformation.

	2. Basic Mathematical Reasoning: Assesses the model's capacity to perform simple arithmetic operations and mathematical problem-solving, gauging its numerical processing capabilities.

	3. Linguistic Analysis: Examines the model's grasp of fundamental language properties such as character counting and frequency analysis, probing its understanding of word structure and composition.

	4. Error Detection and Correction: Challenges the model to identify and rectify typographical errors, testing its language pattern recognition and error handling abilities (tokenization).

	Purpose:
	BaseBench aims to provide a clear, quantifiable measure of a language model's proficiency in these foundational areas. By focusing on these essential skills, the benchmark offers:

	1. A standardized baseline for comparing different models or versions.
	2. Insight into a model's fundamental processing capabilities.
	3. A tool for identifying potential gaps in basic language and data handling skills.
	4. A means to track incremental improvements in core model competencies.
	5. Difficult enough to avoid saturation

	\| Rank \| Model \| Accuracy \| Time \| Speed \|
	\|------\|------------------------------------\|--------------------\|-------\|-----------\|
	\| 1 \| openai/gpt-4o \| 59.00% (1475/2500) \| 03:17 \| 12.66it/s \|
	\| 2 \| anthropic/claude-3.5-sonnet:beta \| 52.56% (1314/2500) \| 14:44 \| 2.83it/s \|
	\| 3 \| mistralai/mistral-large-2407 \| 37.20% (930/2500) \| 05:13 \| 7.96it/s \|
	\| 4 \| openai/gpt-4o-mini \| 36.92% (923/2500) \| 08:28 \| 4.91it/s \|
	\| 5 \| anthropic/claude-3-haiku:beta \| 36.72% (918/2500) \| 06:20 \| 6.57it/s \|
	\| 6 \| google/gemini-pro-1.5 \| 26.92% (673/2500) \| 03:05 \| 13.51it/s \|
	\| 7 \| google/gemma-2-27b-it \| 25.24% (631/2500) \| 05:52 \| 7.08it/s \|
	\| 8 \| meta-llama/llama-3.1-405b-instruct \| 24.24% (606/2500) \| 07:19 \| 5.69it/s \|
	\| 9 \| 01-ai/yi-large \| 20.68% (517/2500) \| 02:37 \| 15.83it/s \|
	\| 10 \| mistralai/mixtral-8x22b-instruct \| 19.60% (490/2500) \| 04:32 \| 9.18it/s \|
	\| 11 \| meta-llama/llama-3.1-70b-instruct \| 19.04% (476/2500) \| 18:01 \| 2.31it/s \|

	Insights:
	- GPT models lead (only Anthropic's flagship manages to beat 4o-mini)
	- Mistral Large is an outlier, however it beats GPT-4o-mini easily (also corresponding to the MMLU-Pro score)
	- Llama models score fairly low
	- Closed source models/proprietry tend to score better (Mistral Large), due to training differences?
	- Gemini is fast, however quality is comparable to Gemma
	`;

	document.addEventListener('DOMContentLoaded', function() {
	const content = document.getElementById('content');
	content.innerHTML = marked.parse(markdown);

	const themeToggle = document.getElementById('theme-toggle');
	themeToggle.addEventListener('click', function() {
	document.body.classList.toggle('dark-mode');
	});
	});
	</script>
	</body>
	</html>