y0news
#exploration2 articles
2 articles
AIBullisharXiv โ€“ CS AI ยท 4h ago6
๐Ÿง 

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Researchers propose SCOPE, a new framework for Reinforcement Learning from Verifiable Rewards (RLVR) that improves AI reasoning by salvaging partially correct solutions rather than discarding them entirely. The method achieves 46.6% accuracy on math reasoning tasks and 53.4% on out-of-distribution problems by using step-wise correction to maintain exploration diversity.

AINeutralarXiv โ€“ CS AI ยท 4h ago0
๐Ÿง 

Adaptive Correlation-Weighted Intrinsic Rewards for Reinforcement Learning

Researchers propose ACWI, a new reinforcement learning framework that dynamically balances intrinsic and extrinsic rewards through adaptive scaling coefficients. The system uses a lightweight Beta Network to optimize exploration in sparse reward environments, demonstrating improved sample efficiency and stability in MiniGrid experiments.