SAGE-Bench

Sleeping

App Files Files Community

SAGE-Bench / README.md

SAGE OSS Evaluator

update

f235878 3 months ago

preview code

raw

history blame contribute delete

2.5 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: SAGE Benchmark
emoji: 🧪
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: SAGE Scientific Reasoning Benchmark Leaderboard
sdk_version: 5.43.1
hf_oauth: true
tags:
  - leaderboard
  - science
  - benchmark
  - evaluation

SAGE: Science AGent Evaluation Benchmark

SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).

Benchmark Overview

SAGE evaluates models across seven core scientific fields covering the key domains of AI for Science (AI4S):

Mathematics - Abstract algebra, analysis, differential equations, and computational mathematics
Physics - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
Chemistry - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
Biology - Genetics, immunology, molecular biology, biophysics, and ecology
Computer Science - Computer architecture, artificial intelligence, and software fundamentals
Earth Science - Geography, geodesy, atmospheric chemistry, marine science, and geology
Materials Science - Composite materials, metal materials, organic polymer materials, and material synthesis

Submission Format

Submit your evaluation results as JSON files with the following format:

{
    "submission_org": "Your Organization",
    "submission_email": "contact@example.com",
    "predictions": [
        {
            "original_question_id": 0,
            "content": ["answer1", "answer2", "answer3", "answer4"],
            "reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
        }
    ]
}

Key Features

Simplified Interface: Clean, easy-to-use interface focused on SAGE benchmark results
Real-time Evaluation: Immediate processing and scoring of submissions
Multi-domain Analysis: Detailed breakdown across scientific domains
Persistent Leaderboard: Results are automatically saved and persist across sessions

Code Structure

src/about.py - SAGE-specific task definitions and content
src/leaderboard/sage_eval.py - SAGE evaluation logic and result processing
src/submission/sage_submit.py - Simplified submission processing
initial_sage_results.json - Benchmark results from major models