y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reinforcement-learning News & Analysis

511 articles tagged with #reinforcement-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

511 articles
AIBullisharXiv – CS AI Β· Mar 27/1019
🧠

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

Researchers developed SocialNav, a foundation model for socially-aware robot navigation that uses a hierarchical architecture to understand social norms and generate compliant movement paths. The model was trained on 7 million samples and achieved 38% better success rates and 46% improved social compliance compared to existing methods.

AIBullisharXiv – CS AI Β· Mar 27/1016
🧠

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Researchers developed Score Matched Actor-Critic (SMAC), a new offline reinforcement learning method that enables smooth transition to online RL algorithms without performance drops. SMAC achieved successful transfer in all 6 D4RL tasks tested and reduced regret by 34-58% in 4 of 6 environments compared to best baselines.

AIBullisharXiv – CS AI Β· Mar 27/1021
🧠

DeepEyesV2: Toward Agentic Multimodal Model

DeepEyesV2 is a new agentic multimodal AI model that combines text and image comprehension with external tool integration like code execution and web search. The research introduces a two-stage training pipeline and RealX-Bench evaluation framework, demonstrating improved real-world reasoning capabilities through adaptive tool invocation.

AIBullisharXiv – CS AI Β· Mar 26/1012
🧠

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Researchers introduce SeaΒ² (See, Act, Adapt), a novel approach that improves AI perception models in new environments by using an intelligent pose-control agent rather than retraining the models themselves. The method keeps perception modules frozen and uses a vision-language model as a controller, achieving significant performance improvements of 13-27% across visual tasks without requiring additional training data.

AIBullisharXiv – CS AI Β· Mar 27/1015
🧠

Learning to Generate Secure Code via Token-Level Rewards

Researchers have developed Vul2Safe, a new framework for generating secure code using large language models, which addresses security vulnerabilities through self-reflection and token-level reinforcement learning. The approach introduces the PrimeVul+ dataset and SRCode training framework to provide more precise optimization of security patterns in code generation.

AIBullisharXiv – CS AI Β· Mar 27/1015
🧠

Portfolio Reinforcement Learning with Scenario-Context Rollout

Researchers developed a new portfolio reinforcement learning method called macro-conditioned scenario-context rollout (SCR) that addresses market regime shifts and distribution changes. The approach generates plausible return scenarios under stress events and improves portfolio performance by up to 76% in Sharpe ratio and reduces maximum drawdown by 53%.

AIBullisharXiv – CS AI Β· Mar 26/1014
🧠

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

Researchers introduced AC3 (Actor-Critic for Continuous Chunks), a new reinforcement learning framework that addresses challenges in long-horizon robotic manipulation tasks with sparse rewards. The system uses continuous action chunks with stabilization mechanisms and achieved superior performance on 25 benchmark tasks using minimal demonstrations.

AINeutralarXiv – CS AI Β· Mar 27/1022
🧠

Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control

Researchers developed an offline-to-online reinforcement learning framework that improves robot control robustness through adversarial fine-tuning. The method trains policies on clean datasets then applies action perturbations during fine-tuning to build resilience against actuator faults and environmental uncertainties.

AINeutralarXiv – CS AI Β· Mar 27/1013
🧠

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

Researchers propose SafeQIL, a new Q-learning algorithm that learns safe policies from expert demonstrations in constrained environments where safety constraints are unknown. The approach balances maximizing task rewards while maintaining safety by learning from demonstrated trajectories that successfully complete tasks without violating hidden constraints.

AIBullisharXiv – CS AI Β· Mar 26/1015
🧠

OM2P: Offline Multi-Agent Mean-Flow Policy

Researchers propose OM2P, a new offline multi-agent reinforcement learning algorithm that achieves efficient one-step action sampling using mean-flow models. The approach delivers up to 3.8x reduction in GPU memory usage and 10.8x speed-up in training time compared to existing diffusion and flow-based models.

AIBullisharXiv – CS AI Β· Mar 26/1014
🧠

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Researchers propose SCOPE, a new framework for Reinforcement Learning from Verifiable Rewards (RLVR) that improves AI reasoning by salvaging partially correct solutions rather than discarding them entirely. The method achieves 46.6% accuracy on math reasoning tasks and 53.4% on out-of-distribution problems by using step-wise correction to maintain exploration diversity.

AIBullisharXiv – CS AI Β· Mar 27/1015
🧠

Real-Time Aligned Reward Model beyond Semantics

Researchers introduce R2M (Real-Time Aligned Reward Model), a new framework for Reinforcement Learning from Human Feedback (RLHF) that addresses reward overoptimization in large language models. The system uses real-time policy feedback to better align reward models with evolving policy distributions during training.

AIBullisharXiv – CS AI Β· Mar 27/1016
🧠

Automating the Refinement of Reinforcement Learning Specifications

Researchers introduce AutoSpec, a framework that automatically refines reinforcement learning specifications to help AI agents learn complex tasks more effectively. The system improves coarse-grained logical specifications through exploration-guided strategies while maintaining specification soundness, demonstrating promising improvements in solving complex control tasks.

AIBullisharXiv – CS AI Β· Mar 26/1016
🧠

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Researchers introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that improves AI reasoning efficiency by helping large reasoning models know when to stop thinking. The approach addresses the problem of redundant, lengthy reasoning chains that don't improve accuracy while reducing computational costs and response times.

AIBullisharXiv – CS AI Β· Mar 27/1013
🧠

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Researchers developed CUDA Agent, a reinforcement learning system that significantly outperforms existing methods for GPU kernel optimization, achieving 100% faster performance than torch.compile on benchmark tests. The system uses large-scale agentic RL with automated verification and profiling to improve CUDA kernel generation, addressing a critical bottleneck in deep learning performance.

AIBullisharXiv – CS AI Β· Mar 26/1020
🧠

Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty

Researchers developed ARLCP, a reinforcement learning framework that reduces unnecessary reflection in Large Reasoning Models, achieving 53% shorter responses while improving accuracy by 5.8% on smaller models. The method addresses computational inefficiencies in AI reasoning by dynamically balancing efficiency and accuracy through adaptive penalties.

AIBullisharXiv – CS AI Β· Mar 26/1022
🧠

RUMAD: Reinforcement-Unifying Multi-Agent Debate

Researchers introduce RUMAD, a reinforcement learning framework that optimizes multi-agent AI debate systems by dynamically controlling communication topology. The system achieves over 80% reduction in computational costs while improving reasoning accuracy across benchmark tests, with strong generalization capabilities across different task domains.

AIBullisharXiv – CS AI Β· Mar 27/1011
🧠

Foundation World Models for Agents that Learn, Verify, and Adapt Reliably Beyond Static Environments

Researchers propose a new framework for foundation world models that enables autonomous agents to learn, verify, and adapt reliably in dynamic environments. The approach combines reinforcement learning with formal verification and adaptive abstraction to create agents that can synthesize verifiable programs and maintain correctness while adapting to novel conditions.

AINeutralarXiv – CS AI Β· Mar 27/1015
🧠

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Research reveals that reward model accuracy alone doesn't determine effectiveness in RLHF systems. The study proves that low reward variance can create flat optimization landscapes, making even perfectly accurate reward models inefficient teachers that underperform less accurate models with higher variance.

AIBullisharXiv – CS AI Β· Mar 26/1013
🧠

RF-Agent: Automated Reward Function Design via Language Agent Tree Search

Researchers introduce RF-Agent, a framework that uses Large Language Models as agents to automatically design reward functions for control tasks through Monte Carlo Tree Search. The method improves upon existing approaches by better utilizing historical feedback and enhancing search efficiency across 17 diverse low-level control tasks.

AINeutralarXiv – CS AI Β· Feb 275/108
🧠

Soft Sequence Policy Optimization

Researchers introduce Soft Sequence Policy Optimization (SSPO), a new reinforcement learning method for training Large Language Models that improves upon existing policy optimization approaches. The technique uses soft gating functions and sequence-level importance sampling to enhance training stability and performance in mathematical reasoning tasks.

AINeutralarXiv – CS AI Β· Feb 275/104
🧠

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Researchers propose QSIM, a new framework that addresses systematic Q-value overestimation in multi-agent reinforcement learning by using action similarity weighted Q-learning instead of traditional greedy approaches. The method demonstrates improved performance and stability across various value decomposition algorithms through similarity-weighted target calculations.

$NEAR
AIBullisharXiv – CS AI Β· Feb 276/105
🧠

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

Researchers introduced NoRD (No Reasoning for Driving), a Vision-Language-Action model for autonomous driving that achieves competitive performance using 60% less training data and no reasoning annotations. The model incorporates Dr. GRPO algorithm to overcome difficulty bias issues in reinforcement learning, demonstrating successful results on Waymo and NAVSIM benchmarks.

AINeutralarXiv – CS AI Β· Feb 275/107
🧠

Same Words, Different Judgments: Modality Effects on Preference Alignment

Researchers conducted a cross-modal study comparing human preference annotations between text and audio formats for AI alignment. The study found that while audio preferences are as reliable as text, different modalities lead to different judgment patterns, with synthetic ratings showing promise as replacements for human annotations.

$NEAR
AIBullisharXiv – CS AI Β· Feb 276/106
🧠

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Researchers have developed LLM4Cov, an offline learning framework that enables AI agents to generate high-coverage hardware verification testbenches without expensive online reinforcement learning. A compact 4B-parameter model achieved 69.2% coverage pass rate, outperforming larger models by demonstrating efficient learning from execution feedback in hardware verification tasks.