SAGE-Bench

Sleeping

App Files Files Community

SAGE-Bench / README.md

SAGE OSS Evaluator

update

f235878 3 months ago

preview code

raw

history blame contribute delete

2.5 kB

	---
	title: SAGE Benchmark
	emoji: 🧪
	colorFrom: green
	colorTo: indigo
	sdk: gradio
	app_file: app.py
	pinned: true
	license: apache-2.0
	short_description: SAGE Scientific Reasoning Benchmark Leaderboard
	sdk_version: 5.43.1
	hf_oauth: true
	tags:
	- leaderboard
	- science
	- benchmark
	- evaluation
	---

	# SAGE: Science AGent Evaluation Benchmark

	SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).

	## Benchmark Overview

	SAGE evaluates models across seven core scientific fields covering the key domains of AI for Science (AI4S):
	- Mathematics - Abstract algebra, analysis, differential equations, and computational mathematics
	- Physics - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
	- Chemistry - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
	- Biology - Genetics, immunology, molecular biology, biophysics, and ecology
	- Computer Science - Computer architecture, artificial intelligence, and software fundamentals
	- Earth Science - Geography, geodesy, atmospheric chemistry, marine science, and geology
	- Materials Science - Composite materials, metal materials, organic polymer materials, and material synthesis

	## Submission Format

	Submit your evaluation results as JSON files with the following format:

	```json
	{
	"submission_org": "Your Organization",
	"submission_email": "contact@example.com",
	"predictions": [
	{
	"original_question_id": 0,
	"content": ["answer1", "answer2", "answer3", "answer4"],
	"reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
	}
	]
	}
	```

	## Key Features

	- Simplified Interface: Clean, easy-to-use interface focused on SAGE benchmark results
	- Real-time Evaluation: Immediate processing and scoring of submissions
	- Multi-domain Analysis: Detailed breakdown across scientific domains
	- Persistent Leaderboard: Results are automatically saved and persist across sessions

	## Code Structure

	- `src/about.py` - SAGE-specific task definitions and content
	- `src/leaderboard/sage_eval.py` - SAGE evaluation logic and result processing
	- `src/submission/sage_submit.py` - Simplified submission processing
	- `initial_sage_results.json` - Benchmark results from major models