PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary Paper • 2601.10201 • Published 4 days ago • 7
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization Paper • 2601.05242 • Published 11 days ago • 194
GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving Paper • 2510.11769 • Published Oct 13, 2025 • 25
Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models Paper • 2506.18945 • Published Jun 23, 2025 • 40
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning Paper • 2506.13654 • Published Jun 16, 2025 • 43
ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind Paper • 2505.22961 • Published May 29, 2025 • 8
Time-R1: Towards Comprehensive Temporal Reasoning in LLMs Paper • 2505.13508 • Published May 16, 2025 • 15
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce Paper • 2504.11343 • Published Apr 15, 2025 • 19
Rethinking Diverse Human Preference Learning through Principal Component Analysis Paper • 2502.13131 • Published Feb 18, 2025 • 37
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce Paper • 2504.11343 • Published Apr 15, 2025 • 19
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL Paper • 2505.02391 • Published May 5, 2025 • 25
Rethinking Diverse Human Preference Learning through Principal Component Analysis Paper • 2502.13131 • Published Feb 18, 2025 • 37
MaxwellJryao/sft_loraMoE_wiki_hop_original_choose_best_object_affirmative_1-lora-sft_Qwen2-1.5B_lr-1e-3 Updated Sep 5, 2024
Post-training-Data-Flywheel/NousResearch-hermes-function-calling-v1 Viewer • Updated Aug 30, 2024 • 1.89k • 6
Post-training-Data-Flywheel/glaiveai-glaive-function-calling-v2 Viewer • Updated Aug 23, 2024 • 75.2k • 1 • 1
Post-training-Data-Flywheel/ise-uiuc-Magicoder-OSS-Instruct-75K Viewer • Updated Aug 23, 2024 • 75.2k • 4