Spaces:
Sleeping
Sleeping
File size: 2,500 Bytes
f9e337d 2086543 f9e337d 487db05 f9e337d f235878 f9e337d 4f2d02a 2086543 f9e337d 2086543 f9e337d 2086543 f9e337d 2086543 f9e337d 2086543 f9e337d 2086543 f9e337d 2086543 f9e337d 2086543 f9e337d 2086543 1232cb8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
---
title: SAGE Benchmark
emoji: 🧪
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: SAGE Scientific Reasoning Benchmark Leaderboard
sdk_version: 5.43.1
hf_oauth: true
tags:
- leaderboard
- science
- benchmark
- evaluation
---
# SAGE: Science AGent Evaluation Benchmark
SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).
## Benchmark Overview
SAGE evaluates models across seven core scientific fields covering the key domains of AI for Science (AI4S):
- **Mathematics** - Abstract algebra, analysis, differential equations, and computational mathematics
- **Physics** - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
- **Chemistry** - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
- **Biology** - Genetics, immunology, molecular biology, biophysics, and ecology
- **Computer Science** - Computer architecture, artificial intelligence, and software fundamentals
- **Earth Science** - Geography, geodesy, atmospheric chemistry, marine science, and geology
- **Materials Science** - Composite materials, metal materials, organic polymer materials, and material synthesis
## Submission Format
Submit your evaluation results as JSON files with the following format:
```json
{
"submission_org": "Your Organization",
"submission_email": "contact@example.com",
"predictions": [
{
"original_question_id": 0,
"content": ["answer1", "answer2", "answer3", "answer4"],
"reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
}
]
}
```
## Key Features
- **Simplified Interface**: Clean, easy-to-use interface focused on SAGE benchmark results
- **Real-time Evaluation**: Immediate processing and scoring of submissions
- **Multi-domain Analysis**: Detailed breakdown across scientific domains
- **Persistent Leaderboard**: Results are automatically saved and persist across sessions
## Code Structure
- `src/about.py` - SAGE-specific task definitions and content
- `src/leaderboard/sage_eval.py` - SAGE evaluation logic and result processing
- `src/submission/sage_submit.py` - Simplified submission processing
- `initial_sage_results.json` - Benchmark results from major models |