Spaces:
Sleeping
Sleeping
| title: SAGE Benchmark | |
| emoji: 🧪 | |
| colorFrom: green | |
| colorTo: indigo | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: true | |
| license: apache-2.0 | |
| short_description: SAGE Scientific Reasoning Benchmark Leaderboard | |
| sdk_version: 5.43.1 | |
| hf_oauth: true | |
| tags: | |
| - leaderboard | |
| - science | |
| - benchmark | |
| - evaluation | |
| # SAGE: Science AGent Evaluation Benchmark | |
| SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs). | |
| ## Benchmark Overview | |
| SAGE evaluates models across seven core scientific fields covering the key domains of AI for Science (AI4S): | |
| - **Mathematics** - Abstract algebra, analysis, differential equations, and computational mathematics | |
| - **Physics** - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics | |
| - **Chemistry** - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry | |
| - **Biology** - Genetics, immunology, molecular biology, biophysics, and ecology | |
| - **Computer Science** - Computer architecture, artificial intelligence, and software fundamentals | |
| - **Earth Science** - Geography, geodesy, atmospheric chemistry, marine science, and geology | |
| - **Materials Science** - Composite materials, metal materials, organic polymer materials, and material synthesis | |
| ## Submission Format | |
| Submit your evaluation results as JSON files with the following format: | |
| ```json | |
| { | |
| "submission_org": "Your Organization", | |
| "submission_email": "contact@example.com", | |
| "predictions": [ | |
| { | |
| "original_question_id": 0, | |
| "content": ["answer1", "answer2", "answer3", "answer4"], | |
| "reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"] | |
| } | |
| ] | |
| } | |
| ``` | |
| ## Key Features | |
| - **Simplified Interface**: Clean, easy-to-use interface focused on SAGE benchmark results | |
| - **Real-time Evaluation**: Immediate processing and scoring of submissions | |
| - **Multi-domain Analysis**: Detailed breakdown across scientific domains | |
| - **Persistent Leaderboard**: Results are automatically saved and persist across sessions | |
| ## Code Structure | |
| - `src/about.py` - SAGE-specific task definitions and content | |
| - `src/leaderboard/sage_eval.py` - SAGE evaluation logic and result processing | |
| - `src/submission/sage_submit.py` - Simplified submission processing | |
| - `initial_sage_results.json` - Benchmark results from major models |