Better Process Supervision with Bi-directional Rewarding Signals Paper • 2503.04618 • Published Mar 6
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning Paper • 2509.08755 • Published Sep 10 • 56
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping Paper • 2510.18927 • Published Oct 21 • 83
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction Paper • 2512.04987 • Published 2 days ago • 60
Pre-Trained Policy Discriminators are General Reward Models Paper • 2507.05197 • Published Jul 7 • 39
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset Paper • 2507.03483 • Published Jul 4 • 23
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting Paper • 2503.00784 • Published Mar 2 • 13
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning Paper • 2402.05808 • Published Feb 8, 2024
CritiQ: Mining Data Quality Criteria from Human Preferences Paper • 2502.19279 • Published Feb 26 • 10
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments Paper • 2406.04151 • Published Jun 6, 2024 • 24
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation Paper • 2402.13013 • Published Feb 20, 2024 • 1
CoLLiE: Collaborative Training of Large Language Models in an Efficient Way Paper • 2312.00407 • Published Dec 1, 2023 • 3