SAGE-Bench / README.md
SAGE OSS Evaluator
update
f235878
---
title: SAGE Benchmark
emoji: 🧪
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: SAGE Scientific Reasoning Benchmark Leaderboard
sdk_version: 5.43.1
hf_oauth: true
tags:
- leaderboard
- science
- benchmark
- evaluation
---
# SAGE: Science AGent Evaluation Benchmark
SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).
## Benchmark Overview
SAGE evaluates models across seven core scientific fields covering the key domains of AI for Science (AI4S):
- **Mathematics** - Abstract algebra, analysis, differential equations, and computational mathematics
- **Physics** - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
- **Chemistry** - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
- **Biology** - Genetics, immunology, molecular biology, biophysics, and ecology
- **Computer Science** - Computer architecture, artificial intelligence, and software fundamentals
- **Earth Science** - Geography, geodesy, atmospheric chemistry, marine science, and geology
- **Materials Science** - Composite materials, metal materials, organic polymer materials, and material synthesis
## Submission Format
Submit your evaluation results as JSON files with the following format:
```json
{
"submission_org": "Your Organization",
"submission_email": "contact@example.com",
"predictions": [
{
"original_question_id": 0,
"content": ["answer1", "answer2", "answer3", "answer4"],
"reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
}
]
}
```
## Key Features
- **Simplified Interface**: Clean, easy-to-use interface focused on SAGE benchmark results
- **Real-time Evaluation**: Immediate processing and scoring of submissions
- **Multi-domain Analysis**: Detailed breakdown across scientific domains
- **Persistent Leaderboard**: Results are automatically saved and persist across sessions
## Code Structure
- `src/about.py` - SAGE-specific task definitions and content
- `src/leaderboard/sage_eval.py` - SAGE evaluation logic and result processing
- `src/submission/sage_submit.py` - Simplified submission processing
- `initial_sage_results.json` - Benchmark results from major models