Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms Paper • 2510.13913 • Published Oct 15 • 3
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math Paper • 2510.13744 • Published Oct 15 • 5
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents Paper • 2509.06283 • Published Sep 8 • 17
Concentration of Measure for Distributions Generated via Diffusion Models Paper • 2501.07741 • Published Jan 13
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation Paper • 2501.05414 • Published Jan 9 • 2
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models Paper • 2505.13444 • Published May 19 • 17
CodeUpdateArena: Benchmarking Knowledge Editing on API Updates Paper • 2407.06249 • Published Jul 8, 2024
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" Paper • 2410.03727 • Published Sep 30, 2024 • 2
Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation Paper • 2310.03780 • Published Oct 5, 2023
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models Paper • 2502.06608 • Published Feb 10 • 39
PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models Paper • 2502.01584 • Published Feb 3 • 9
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Paper • 2411.07975 • Published Nov 12, 2024 • 31
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper • 2410.13848 • Published Oct 17, 2024 • 35
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics Paper • 2102.01672 • Published Feb 2, 2021