Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published May 19 • 45
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation Paper • 2506.07977 • Published Jun 9 • 41
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published Jun 13 • 23
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification Paper • 2506.15569 • Published Jun 18 • 12
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation Paper • 2506.14028 • Published Jun 16 • 93
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents Paper • 2506.11763 • Published Jun 13 • 72
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning Paper • 2506.09049 • Published Jun 10 • 37
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers Paper • 2507.02694 • Published Jul 3 • 19
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once Paper • 2507.10541 • Published Jul 14 • 29
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs Paper • 2507.08616 • Published Jul 11 • 13
The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations Paper • 2507.13302 • Published Jul 17 • 4
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research Paper • 2507.13300 • Published Jul 17 • 19
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering Paper • 2507.11527 • Published Jul 15 • 32
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Paper • 2507.10787 • Published Jul 14 • 12
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents Paper • 2508.13186 • Published Aug 14 • 18
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions Paper • 2508.16402 • Published Aug 22 • 14
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20 • 43
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model Paper • 2508.14444 • Published Aug 20 • 38
T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation Paper • 2508.17472 • Published Aug 24 • 26
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks Paper • 2508.15804 • Published Aug 14 • 15
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers Paper • 2508.20453 • Published Aug 28 • 63
DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks Paper • 2509.01396 • Published Sep 1 • 57
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs Paper • 2509.04013 • Published Sep 4 • 4
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge Paper • 2509.07968 • Published Sep 9 • 14
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering Paper • 2509.09614 • Published Sep 11 • 7
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark Paper • 2501.01290 • Published Jan 2
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28 • 173
OceanGym: A Benchmark Environment for Underwater Embodied Agents Paper • 2509.26536 • Published Sep 30 • 34
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs Paper • 2510.09507 • Published Oct 10 • 10
PICABench: How Far Are We from Physically Realistic Image Editing? Paper • 2510.17681 • Published Oct 20 • 62
LiveTradeBench: Seeking Real-World Alpha with Large Language Models Paper • 2511.03628 • Published Nov 5 • 11
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks Paper • 2511.15065 • Published 18 days ago • 74
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark Paper • 2511.17729 • Published 15 days ago • 16
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward Paper • 2511.20561 • Published 11 days ago • 31
RefineBench: Evaluating Refinement Capability of Language Models via Checklists Paper • 2511.22173 • Published 10 days ago • 12