511 articles tagged with #reinforcement-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv β CS AI Β· Mar 27/1019
π§ Researchers developed SocialNav, a foundation model for socially-aware robot navigation that uses a hierarchical architecture to understand social norms and generate compliant movement paths. The model was trained on 7 million samples and achieved 38% better success rates and 46% improved social compliance compared to existing methods.
AIBullisharXiv β CS AI Β· Mar 27/1016
π§ Researchers developed Score Matched Actor-Critic (SMAC), a new offline reinforcement learning method that enables smooth transition to online RL algorithms without performance drops. SMAC achieved successful transfer in all 6 D4RL tasks tested and reduced regret by 34-58% in 4 of 6 environments compared to best baselines.
AIBullisharXiv β CS AI Β· Mar 27/1021
π§ DeepEyesV2 is a new agentic multimodal AI model that combines text and image comprehension with external tool integration like code execution and web search. The research introduces a two-stage training pipeline and RealX-Bench evaluation framework, demonstrating improved real-world reasoning capabilities through adaptive tool invocation.
AIBullisharXiv β CS AI Β· Mar 26/1012
π§ Researchers introduce SeaΒ² (See, Act, Adapt), a novel approach that improves AI perception models in new environments by using an intelligent pose-control agent rather than retraining the models themselves. The method keeps perception modules frozen and uses a vision-language model as a controller, achieving significant performance improvements of 13-27% across visual tasks without requiring additional training data.
AIBullisharXiv β CS AI Β· Mar 27/1015
π§ Researchers have developed Vul2Safe, a new framework for generating secure code using large language models, which addresses security vulnerabilities through self-reflection and token-level reinforcement learning. The approach introduces the PrimeVul+ dataset and SRCode training framework to provide more precise optimization of security patterns in code generation.
AIBullisharXiv β CS AI Β· Mar 27/1015
π§ Researchers developed a new portfolio reinforcement learning method called macro-conditioned scenario-context rollout (SCR) that addresses market regime shifts and distribution changes. The approach generates plausible return scenarios under stress events and improves portfolio performance by up to 76% in Sharpe ratio and reduces maximum drawdown by 53%.
AIBullisharXiv β CS AI Β· Mar 26/1014
π§ Researchers introduced AC3 (Actor-Critic for Continuous Chunks), a new reinforcement learning framework that addresses challenges in long-horizon robotic manipulation tasks with sparse rewards. The system uses continuous action chunks with stabilization mechanisms and achieved superior performance on 25 benchmark tasks using minimal demonstrations.
AINeutralarXiv β CS AI Β· Mar 27/1022
π§ Researchers developed an offline-to-online reinforcement learning framework that improves robot control robustness through adversarial fine-tuning. The method trains policies on clean datasets then applies action perturbations during fine-tuning to build resilience against actuator faults and environmental uncertainties.
AINeutralarXiv β CS AI Β· Mar 27/1013
π§ Researchers propose SafeQIL, a new Q-learning algorithm that learns safe policies from expert demonstrations in constrained environments where safety constraints are unknown. The approach balances maximizing task rewards while maintaining safety by learning from demonstrated trajectories that successfully complete tasks without violating hidden constraints.
AIBullisharXiv β CS AI Β· Mar 26/1015
π§ Researchers propose OM2P, a new offline multi-agent reinforcement learning algorithm that achieves efficient one-step action sampling using mean-flow models. The approach delivers up to 3.8x reduction in GPU memory usage and 10.8x speed-up in training time compared to existing diffusion and flow-based models.
AIBullisharXiv β CS AI Β· Mar 26/1014
π§ Researchers propose SCOPE, a new framework for Reinforcement Learning from Verifiable Rewards (RLVR) that improves AI reasoning by salvaging partially correct solutions rather than discarding them entirely. The method achieves 46.6% accuracy on math reasoning tasks and 53.4% on out-of-distribution problems by using step-wise correction to maintain exploration diversity.
AIBullisharXiv β CS AI Β· Mar 27/1015
π§ Researchers introduce R2M (Real-Time Aligned Reward Model), a new framework for Reinforcement Learning from Human Feedback (RLHF) that addresses reward overoptimization in large language models. The system uses real-time policy feedback to better align reward models with evolving policy distributions during training.
AIBullisharXiv β CS AI Β· Mar 27/1016
π§ Researchers introduce AutoSpec, a framework that automatically refines reinforcement learning specifications to help AI agents learn complex tasks more effectively. The system improves coarse-grained logical specifications through exploration-guided strategies while maintaining specification soundness, demonstrating promising improvements in solving complex control tasks.
AIBullisharXiv β CS AI Β· Mar 26/1016
π§ Researchers introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that improves AI reasoning efficiency by helping large reasoning models know when to stop thinking. The approach addresses the problem of redundant, lengthy reasoning chains that don't improve accuracy while reducing computational costs and response times.
AIBullisharXiv β CS AI Β· Mar 27/1013
π§ Researchers developed CUDA Agent, a reinforcement learning system that significantly outperforms existing methods for GPU kernel optimization, achieving 100% faster performance than torch.compile on benchmark tests. The system uses large-scale agentic RL with automated verification and profiling to improve CUDA kernel generation, addressing a critical bottleneck in deep learning performance.
AIBullisharXiv β CS AI Β· Mar 26/1020
π§ Researchers developed ARLCP, a reinforcement learning framework that reduces unnecessary reflection in Large Reasoning Models, achieving 53% shorter responses while improving accuracy by 5.8% on smaller models. The method addresses computational inefficiencies in AI reasoning by dynamically balancing efficiency and accuracy through adaptive penalties.
AIBullisharXiv β CS AI Β· Mar 26/1022
π§ Researchers introduce RUMAD, a reinforcement learning framework that optimizes multi-agent AI debate systems by dynamically controlling communication topology. The system achieves over 80% reduction in computational costs while improving reasoning accuracy across benchmark tests, with strong generalization capabilities across different task domains.
AIBullisharXiv β CS AI Β· Mar 27/1011
π§ Researchers propose a new framework for foundation world models that enables autonomous agents to learn, verify, and adapt reliably in dynamic environments. The approach combines reinforcement learning with formal verification and adaptive abstraction to create agents that can synthesize verifiable programs and maintain correctness while adapting to novel conditions.
AINeutralarXiv β CS AI Β· Mar 27/1015
π§ Research reveals that reward model accuracy alone doesn't determine effectiveness in RLHF systems. The study proves that low reward variance can create flat optimization landscapes, making even perfectly accurate reward models inefficient teachers that underperform less accurate models with higher variance.
AIBullisharXiv β CS AI Β· Mar 26/1013
π§ Researchers introduce RF-Agent, a framework that uses Large Language Models as agents to automatically design reward functions for control tasks through Monte Carlo Tree Search. The method improves upon existing approaches by better utilizing historical feedback and enhancing search efficiency across 17 diverse low-level control tasks.
AINeutralarXiv β CS AI Β· Feb 275/108
π§ Researchers introduce Soft Sequence Policy Optimization (SSPO), a new reinforcement learning method for training Large Language Models that improves upon existing policy optimization approaches. The technique uses soft gating functions and sequence-level importance sampling to enhance training stability and performance in mathematical reasoning tasks.
AINeutralarXiv β CS AI Β· Feb 275/104
π§ Researchers propose QSIM, a new framework that addresses systematic Q-value overestimation in multi-agent reinforcement learning by using action similarity weighted Q-learning instead of traditional greedy approaches. The method demonstrates improved performance and stability across various value decomposition algorithms through similarity-weighted target calculations.
$NEAR
AIBullisharXiv β CS AI Β· Feb 276/105
π§ Researchers introduced NoRD (No Reasoning for Driving), a Vision-Language-Action model for autonomous driving that achieves competitive performance using 60% less training data and no reasoning annotations. The model incorporates Dr. GRPO algorithm to overcome difficulty bias issues in reinforcement learning, demonstrating successful results on Waymo and NAVSIM benchmarks.
AINeutralarXiv β CS AI Β· Feb 275/107
π§ Researchers conducted a cross-modal study comparing human preference annotations between text and audio formats for AI alignment. The study found that while audio preferences are as reliable as text, different modalities lead to different judgment patterns, with synthetic ratings showing promise as replacements for human annotations.
$NEAR
AIBullisharXiv β CS AI Β· Feb 276/106
π§ Researchers have developed LLM4Cov, an offline learning framework that enables AI agents to generate high-coverage hardware verification testbenches without expensive online reinforcement learning. A compact 4B-parameter model achieved 69.2% coverage pass rate, outperforming larger models by demonstrating efficient learning from execution feedback in hardware verification tasks.