y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reinforcement-learning News & Analysis

511 articles tagged with #reinforcement-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

511 articles
AINeutralarXiv – CS AI Β· 7h ago7/10
🧠

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Researchers conducted a comprehensive empirical study on scaling laws for large language models during reinforcement learning post-training, using Qwen2.5 models ranging from 0.5B to 72B parameters. The study reveals that larger models demonstrate superior learning efficiency, performance can be predicted via power-law models, and data reuse proves highly effective in constrained environments, providing practical guidelines for optimizing LLM reasoning capabilities.

AIBearisharXiv – CS AI Β· 7h ago7/10
🧠

The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination

Researchers demonstrate that enhancing LLM reasoning capabilities through reinforcement learning paradoxically increases tool hallucinationβ€”where models incorrectly invoke non-existent or inappropriate tools. The study reveals a fundamental trade-off where stronger reasoning correlates with higher hallucination rates, suggesting current AI agent development approaches may inherently compromise reliability for capability.

🏒 OpenAI
AIBullisharXiv – CS AI Β· 7h ago7/10
🧠

AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units

Researchers have developed AscendKernelGen, an LLM-based framework that dramatically improves code generation for neural processing units (NPUs) by combining domain-specific training data with reinforcement learning. The system achieves 95.5% compilation success on complex kernels, up from near-zero baseline performance, addressing a critical bottleneck in AI hardware optimization.

🏒 Hugging Face
AIBullisharXiv – CS AI Β· 5d ago7/10
🧠

Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models

Researchers introduce Ariadne, a framework demonstrating that Reinforcement Learning with Verifiable Rewards (RLVR) expands spatial reasoning capabilities in Vision-Language Models beyond their base distribution. Testing on synthetic mazes and real-world navigation benchmarks shows the technique enables models to solve previously unsolvable problems, suggesting genuine capability expansion rather than sampling efficiency.

AIBullisharXiv – CS AI Β· 5d ago7/10
🧠

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Researchers propose a label-free self-supervised reinforcement learning framework that enables language models to follow complex multi-constraint instructions without external supervision. The approach derives reward signals directly from instructions and uses constraint decomposition strategies to address sparse reward challenges, demonstrating strong performance across both in-domain and out-of-domain instruction-following tasks.

AIBullisharXiv – CS AI Β· 5d ago7/10
🧠

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Researchers introduce CropVLM, a reinforcement learning-based method that enables Vision-Language Models to dynamically focus on relevant image regions for improved fine-grained understanding tasks. The approach works with existing VLMs without modification and demonstrates significant performance gains on text recognition and document analysis without requiring human-labeled training data.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning

Researchers introduce ContextCurator, a reinforcement learning-based framework that decouples context management from task execution in LLM agents, addressing the context bottleneck problem. The approach pairs a lightweight specialized policy model with a frozen foundation model, achieving significant improvements in success rates and token efficiency across benchmark tasks.

🧠 GPT-4🧠 Gemini
AIBearisharXiv – CS AI Β· 6d ago7/10
🧠

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

Researchers identify systematic measurement flaws in reinforcement learning with verifiable rewards (RLVR) studies, revealing that widely reported performance gains are often inflated by budget mismatches, data contamination, and calibration drift rather than genuine capability improvements. The paper proposes rigorous evaluation standards to properly assess RLVR effectiveness in AI development.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder is a new machine learning method that learns dense reward signals from passive videos to improve reinforcement learning in robotics. By modeling temporal distances between video frames, the approach achieves 90% success rates on Meta-World tasks using significantly fewer environment interactions than prior methods, while also leveraging human videos for scalable reward learning.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Researchers introduce RL^V, a reinforcement learning method that unifies LLM reasoners with generative verifiers to improve test-time compute scaling. The approach achieves over 20% accuracy gains on MATH benchmarks and enables 8-32x more efficient test-time scaling compared to existing RL methods by preserving and leveraging learned value functions.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

Researchers introduce Inverse-RPO, a methodology for deriving prior-based tree policies in Monte Carlo Tree Search from first principles, and apply it to create variance-aware UCT algorithms that outperform PUCT without additional computational overhead. This advances the theoretical foundation of MCTS used in reinforcement learning systems like AlphaZero.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

GIANTS: Generative Insight Anticipation from Scientific Literature

Researchers introduce GIANTS, a framework for training language models to anticipate scientific breakthroughs by synthesizing insights from foundational papers. The team releases GiantsBench, a 17k-example benchmark across eight scientific domains, and GIANTS-4B, a 4B-parameter model that outperforms larger proprietary baselines by 34% while generalizing to unseen research areas.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

Researchers demonstrate that Reinforcement Learning from Verifiable Rewards (RLVR) can train Large Language Models to negotiate effectively in incomplete-information games like price bargaining. A 30B parameter model trained with this method outperforms frontier models 10x its size and develops sophisticated persuasive strategies while generalizing to unseen negotiation scenarios.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Researchers demonstrate that physics simulators can generate synthetic training data for large language models, enabling them to learn physical reasoning without relying on scarce internet QA pairs. Models trained on simulated data show 5-10 percentage point improvements on International Physics Olympiad problems, suggesting simulators offer a scalable alternative for domain-specific AI training.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

Researchers propose Generative Actor-Critic (GenAC), a new approach to value modeling in large language model reinforcement learning that uses chain-of-thought reasoning instead of one-shot scalar predictions. The method addresses a longstanding challenge in credit assignment by improving value approximation and downstream RL performance compared to existing value-based and value-free baselines.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

From Topology to Trajectory: LLM-Driven World Models For Supply Chain Resilience

Researchers introduce ReflectiChain, an AI framework combining large language models with generative world models to improve semiconductor supply chain resilience against geopolitical disruptions. The system demonstrates 250% performance improvements over standard LLM approaches by integrating physical environmental constraints and autonomous policy learning, restoring operational capacity from 13.3% to 88.5% under extreme scenarios.

AIBullisharXiv – CS AI Β· Apr 137/10
🧠

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Researchers introduced Webscale-RL, a data pipeline that converts large-scale pre-training documents into 1.2 million diverse question-answer pairs for reinforcement learning training. The approach enables RL models to achieve pre-training-level performance with up to 100x fewer tokens, addressing a critical bottleneck in scaling RL data and potentially advancing more efficient language model development.

AIBullisharXiv – CS AI Β· Apr 137/10
🧠

SkillFactory: Self-Distillation For Learning Cognitive Behaviors

SkillFactory is a novel fine-tuning method that enables language models to learn cognitive behaviors like verification and backtracking without requiring distillation from stronger models. The approach uses self-rearranged training samples during supervised fine-tuning to prime models for subsequent reinforcement learning, resulting in better generalization and robustness.

AIBullisharXiv – CS AI Β· Apr 137/10
🧠

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

Researchers introduce SafeAdapt, a novel framework for updating reinforcement learning policies while maintaining provable safety guarantees across changing environments. The approach uses a 'Rashomon set' to identify safe parameter regions and projects policy updates onto this certified space, addressing the critical challenge of deploying RL agents in safety-critical applications where dynamics and objectives evolve over time.

AIBullisharXiv – CS AI Β· Apr 137/10
🧠

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Researchers introduce Humanoid-LLA, a Large Language Action Model enabling humanoid robots to execute complex physical tasks from natural language commands. The system combines a unified motion vocabulary, physics-aware controller, and reinforcement learning to achieve both language understanding and real-world robot control, demonstrating improved performance on Unitree G1 and Booster T1 humanoids.

AIBullisharXiv – CS AI Β· Apr 137/10
🧠

The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs

Researchers introduce the Two-Stage Decision-Sampling Hypothesis to explain how reinforcement learning enables self-reflection capabilities in large language models, demonstrating that RL's superior performance stems from improved decision-making rather than generation quality. The theory shows that reward gradients distribute asymmetrically across policy components, explaining why RL succeeds where supervised fine-tuning fails.

AIBullisharXiv – CS AI Β· Apr 137/10
🧠

Listener-Rewarded Thinking in VLMs for Image Preferences

Researchers introduce a listener-augmented reinforcement learning framework for training vision-language models to better align with human visual preferences. By using an independent frozen model to evaluate and validate reasoning chains, the approach achieves 67.4% accuracy on ImageReward benchmarks and demonstrates significant improvements in out-of-distribution generalization.

🏒 Hugging Face
AIBullisharXiv – CS AI Β· Apr 137/10
🧠

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

Researchers propose Advantage-Guided Diffusion (AGD-MBRL), a novel approach that improves model-based reinforcement learning by using advantage estimates to guide diffusion models during trajectory generation. The method addresses the short-horizon myopia problem in existing diffusion-based world models and demonstrates 2x performance improvements over current baselines on MuJoCo control tasks.

AIBullisharXiv – CS AI Β· Apr 137/10
🧠

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

TensorHub introduces Reference-Oriented Storage (ROS), a novel weight transfer system that enables efficient reinforcement learning training across distributed GPU clusters without physically copying model weights. The production-deployed system achieves significant performance improvements, reducing GPU stall time by up to 6.7x for rollout operations and improving cross-datacenter transfers by 19x.

Page 1 of 21Next β†’