AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers propose Proximal Supervised Fine-Tuning (PSFT), a new method that applies trust-region constraints from reinforcement learning to improve how foundation models adapt to new tasks. The technique maintains model capabilities while fine-tuning, outperforming standard supervised fine-tuning on out-of-domain generalization tasks.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers developed a new reinforcement learning approach for training diffusion language models that uses entropy-guided step selection and stepwise advantages to overcome challenges with sequence-level likelihood calculations. The method achieves state-of-the-art results on coding and logical reasoning benchmarks while being more computationally efficient than existing approaches.
AIBullisharXiv – CS AI · Mar 57/10
🧠Researchers developed HALyPO (Heterogeneous-Agent Lyapunov Policy Optimization), a new approach to improve stability in human-robot collaboration through multi-agent reinforcement learning. The method addresses the 'rationality gap' between human and robot learning by using Lyapunov stability conditions to prevent policy oscillations and divergence during training.
AINeutralarXiv – CS AI · Mar 47/103
🧠New research provides theoretical analysis of reinforcement learning's impact on Large Language Model planning capabilities, revealing that RL improves generalization through exploration while supervised fine-tuning may create spurious solutions. The study shows Q-learning maintains output diversity better than policy gradient methods, with findings validated on real-world planning benchmarks.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers develop a self-play reinforcement learning framework for Big 2, a four-player imperfect-information card game, demonstrating that PPO outperforms value-based methods under controlled conditions. The study reveals that entropy regularization and current-policy self-play improve agent performance, establishing Big 2 as a useful benchmark for testing deep RL in complex multi-agent environments with hidden information and variable action spaces.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce Recurrent Structural Policy Gradient (RSPG), an algorithmic advancement for solving Mean Field Games with partial observability by combining policy gradient methods with structural knowledge of system dynamics. The method achieves significantly faster convergence than model-free approaches while enabling history-aware behavior, accompanied by MFAX, a new JAX-based research framework for MFG implementations.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce ProRL, a reinforcement learning framework designed to improve proactive recommender systems that guide users toward target items through sequential recommendations. The approach addresses fundamental gradient estimation problems in policy learning by implementing stepwise reward centering and position-specific advantage estimation, demonstrating superior performance on real-world datasets.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers identify critical failure modes in policy-gradient reinforcement learning methods when applied to long-horizon problems with cumulative damage, where short-term attractive actions lead to long-term negative outcomes. The study proposes a decomposition framework separating completion (reaching terminal horizon) from optimality (achieving dynamic-programming benchmarks) and validates predictions across two distinct domains: career planning and sports performance.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers propose Credit-Assigned Policy Gradient (CA-PG), a new machine learning technique that solves the variance problem in training early-stage rankers for two-stage retrieval systems. By computing gradients with respect to individual item selection probability rather than entire candidate sets, CA-PG enables scalable end-to-end training of search and recommendation systems.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce SDPG, a visual reinforcement learning method that trains robotic control policies significantly faster and more efficiently on consumer GPUs. The approach reduces computational overhead through stochastic gradient estimation while maintaining superior performance, and includes new benchmarks for advancing visual robotics research.
🏢 Nvidia
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers propose PANDA, a novel bilevel optimization algorithm for reinforcement learning that handles competitive multi-agent scenarios modeled as zero-sum Markov games. The method achieves state-of-the-art convergence rates without requiring second-order derivatives, advancing RL applications in incentive design and competitive environments.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers provide theoretical foundations for Reinforcement Learning with Verifiable Rewards (RLVR), a technique for post-training large language models using binary feedback. The analysis introduces the 'Gradient Gap' concept to explain convergence dynamics and derives critical step-size thresholds that determine whether training succeeds or fails, with implications for practical implementations like length normalization.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce Slow-Fast Policy Optimization (SFPO), a new reinforcement learning framework that improves training stability and efficiency for large language model reasoning. SFPO outperforms existing methods like GRPO by up to 2.80 points on math benchmarks while requiring up to 4.93x fewer rollouts and 4.19x less training time.
AIBullisharXiv – CS AI · Mar 36/108
🧠Researchers have developed L-REINFORCE, a novel reinforcement learning algorithm that provides probabilistic stability guarantees for control systems using finite data samples. The approach bridges reinforcement learning and control theory by extending classical REINFORCE algorithms with Lyapunov stability methods, demonstrating superior performance in Cartpole simulations.
AINeutralarXiv – CS AI · Mar 36/104
🧠Researchers introduce a new reinforcement learning framework called Distributions-as-Actions (DA) that treats parameterized action distributions as actions, making all action spaces continuous regardless of original type. The approach includes a new policy gradient estimator (DA-PG) with lower variance and a practical actor-critic algorithm (DA-AC) that shows competitive performance across discrete, continuous, and hybrid control tasks.
AIBullisharXiv – CS AI · Mar 26/1014
🧠Researchers propose Trust Region Masking (TRM) to address off-policy mismatch problems in Large Language Model reinforcement learning pipelines. The method provides the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL tasks by masking entire sequences that violate trust region constraints.
AINeutralOpenAI News · Mar 203/105
🧠This appears to be a research paper on policy gradient methods in reinforcement learning, specifically focusing on variance reduction techniques using action-dependent factorized baselines. The article lacks content details, making it difficult to assess specific findings or implications.
AINeutralHugging Face Blog · Jun 301/103
🧠The article appears to be about implementing policy gradient algorithms using the PyTorch framework. However, the article body is empty, making it impossible to provide meaningful analysis of the content or its implications.