SAGE-Bench

Sleeping

File size: 2,500 Bytes

---
title: SAGE Benchmark
emoji: 🧪
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: SAGE Scientific Reasoning Benchmark Leaderboard
sdk_version: 5.43.1
hf_oauth: true
tags:
- leaderboard
- science
- benchmark
- evaluation
---

# SAGE: Science AGent Evaluation Benchmark

SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).

## Benchmark Overview

SAGE evaluates models across seven core scientific fields covering the key domains of AI for Science (AI4S):
- **Mathematics** - Abstract algebra, analysis, differential equations, and computational mathematics
- **Physics** - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
- **Chemistry** - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
- **Biology** - Genetics, immunology, molecular biology, biophysics, and ecology
- **Computer Science** - Computer architecture, artificial intelligence, and software fundamentals
- **Earth Science** - Geography, geodesy, atmospheric chemistry, marine science, and geology
- **Materials Science** - Composite materials, metal materials, organic polymer materials, and material synthesis

## Submission Format

Submit your evaluation results as JSON files with the following format:

```json
{
    "submission_org": "Your Organization",
    "submission_email": "contact@example.com",
    "predictions": [
        {
            "original_question_id": 0,
            "content": ["answer1", "answer2", "answer3", "answer4"],
            "reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
        }
    ]
}
```

## Key Features

- **Simplified Interface**: Clean, easy-to-use interface focused on SAGE benchmark results
- **Real-time Evaluation**: Immediate processing and scoring of submissions
- **Multi-domain Analysis**: Detailed breakdown across scientific domains
- **Persistent Leaderboard**: Results are automatically saved and persist across sessions

## Code Structure

- `src/about.py` - SAGE-specific task definitions and content
- `src/leaderboard/sage_eval.py` - SAGE evaluation logic and result processing
- `src/submission/sage_submit.py` - Simplified submission processing
- `initial_sage_results.json` - Benchmark results from major models