Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
metadata
title: SAGE Benchmark
emoji: 🧪
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: SAGE Scientific Reasoning Benchmark Leaderboard
sdk_version: 5.43.1
hf_oauth: true
tags:
- leaderboard
- science
- benchmark
- evaluation
SAGE: Science AGent Evaluation Benchmark
SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).
Benchmark Overview
SAGE evaluates models across seven core scientific fields covering the key domains of AI for Science (AI4S):
- Mathematics - Abstract algebra, analysis, differential equations, and computational mathematics
- Physics - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
- Chemistry - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
- Biology - Genetics, immunology, molecular biology, biophysics, and ecology
- Computer Science - Computer architecture, artificial intelligence, and software fundamentals
- Earth Science - Geography, geodesy, atmospheric chemistry, marine science, and geology
- Materials Science - Composite materials, metal materials, organic polymer materials, and material synthesis
Submission Format
Submit your evaluation results as JSON files with the following format:
{
"submission_org": "Your Organization",
"submission_email": "contact@example.com",
"predictions": [
{
"original_question_id": 0,
"content": ["answer1", "answer2", "answer3", "answer4"],
"reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
}
]
}
Key Features
- Simplified Interface: Clean, easy-to-use interface focused on SAGE benchmark results
- Real-time Evaluation: Immediate processing and scoring of submissions
- Multi-domain Analysis: Detailed breakdown across scientific domains
- Persistent Leaderboard: Results are automatically saved and persist across sessions
Code Structure
src/about.py- SAGE-specific task definitions and contentsrc/leaderboard/sage_eval.py- SAGE evaluation logic and result processingsrc/submission/sage_submit.py- Simplified submission processinginitial_sage_results.json- Benchmark results from major models