SAGE-Bench / README.md
SAGE OSS Evaluator
update
f235878

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: SAGE Benchmark
emoji: 🧪
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: SAGE Scientific Reasoning Benchmark Leaderboard
sdk_version: 5.43.1
hf_oauth: true
tags:
  - leaderboard
  - science
  - benchmark
  - evaluation

SAGE: Science AGent Evaluation Benchmark

SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).

Benchmark Overview

SAGE evaluates models across seven core scientific fields covering the key domains of AI for Science (AI4S):

  • Mathematics - Abstract algebra, analysis, differential equations, and computational mathematics
  • Physics - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
  • Chemistry - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
  • Biology - Genetics, immunology, molecular biology, biophysics, and ecology
  • Computer Science - Computer architecture, artificial intelligence, and software fundamentals
  • Earth Science - Geography, geodesy, atmospheric chemistry, marine science, and geology
  • Materials Science - Composite materials, metal materials, organic polymer materials, and material synthesis

Submission Format

Submit your evaluation results as JSON files with the following format:

{
    "submission_org": "Your Organization",
    "submission_email": "contact@example.com",
    "predictions": [
        {
            "original_question_id": 0,
            "content": ["answer1", "answer2", "answer3", "answer4"],
            "reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
        }
    ]
}

Key Features

  • Simplified Interface: Clean, easy-to-use interface focused on SAGE benchmark results
  • Real-time Evaluation: Immediate processing and scoring of submissions
  • Multi-domain Analysis: Detailed breakdown across scientific domains
  • Persistent Leaderboard: Results are automatically saved and persist across sessions

Code Structure

  • src/about.py - SAGE-specific task definitions and content
  • src/leaderboard/sage_eval.py - SAGE evaluation logic and result processing
  • src/submission/sage_submit.py - Simplified submission processing
  • initial_sage_results.json - Benchmark results from major models