y0news
#reinforcement-learning35 articles
35 articles
AIBullisharXiv – CS AI · 9h ago8
🧠

Real-Time Aligned Reward Model beyond Semantics

Researchers introduce R2M (Real-Time Aligned Reward Model), a new framework for Reinforcement Learning from Human Feedback (RLHF) that addresses reward overoptimization in large language models. The system uses real-time policy feedback to better align reward models with evolving policy distributions during training.

AINeutralarXiv – CS AI · 9h ago8
🧠

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Research reveals that reward model accuracy alone doesn't determine effectiveness in RLHF systems. The study proves that low reward variance can create flat optimization landscapes, making even perfectly accurate reward models inefficient teachers that underperform less accurate models with higher variance.

AIBullisharXiv – CS AI · 9h ago5
🧠

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Researchers developed Score Matched Actor-Critic (SMAC), a new offline reinforcement learning method that enables smooth transition to online RL algorithms without performance drops. SMAC achieved successful transfer in all 6 D4RL tasks tested and reduced regret by 34-58% in 4 of 6 environments compared to best baselines.

AIBullisharXiv – CS AI · 9h ago5
🧠

OM2P: Offline Multi-Agent Mean-Flow Policy

Researchers propose OM2P, a new offline multi-agent reinforcement learning algorithm that achieves efficient one-step action sampling using mean-flow models. The approach delivers up to 3.8x reduction in GPU memory usage and 10.8x speed-up in training time compared to existing diffusion and flow-based models.

AIBullisharXiv – CS AI · 9h ago4
🧠

Learning to Generate Secure Code via Token-Level Rewards

Researchers have developed Vul2Safe, a new framework for generating secure code using large language models, which addresses security vulnerabilities through self-reflection and token-level reinforcement learning. The approach introduces the PrimeVul+ dataset and SRCode training framework to provide more precise optimization of security patterns in code generation.

AIBullisharXiv – CS AI · 9h ago8
🧠

Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty

Researchers developed ARLCP, a reinforcement learning framework that reduces unnecessary reflection in Large Reasoning Models, achieving 53% shorter responses while improving accuracy by 5.8% on smaller models. The method addresses computational inefficiencies in AI reasoning by dynamically balancing efficiency and accuracy through adaptive penalties.

AIBullisharXiv – CS AI · 9h ago6
🧠

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

Researchers introduced AC3 (Actor-Critic for Continuous Chunks), a new reinforcement learning framework that addresses challenges in long-horizon robotic manipulation tasks with sparse rewards. The system uses continuous action chunks with stabilization mechanisms and achieved superior performance on 25 benchmark tasks using minimal demonstrations.

AIBullisharXiv – CS AI · 9h ago6
🧠

Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control

Researchers introduce EAGLE, a reinforcement learning framework that creates unified control policies for multiple different humanoid robots without per-robot tuning. The system uses iterative generalist-specialist distillation to enable a single AI controller to manage diverse humanoid embodiments and support complex behaviors beyond basic walking.

AIBullisharXiv – CS AI · 9h ago4
🧠

Portfolio Reinforcement Learning with Scenario-Context Rollout

Researchers developed a new portfolio reinforcement learning method called macro-conditioned scenario-context rollout (SCR) that addresses market regime shifts and distribution changes. The approach generates plausible return scenarios under stress events and improves portfolio performance by up to 76% in Sharpe ratio and reduces maximum drawdown by 53%.

AIBullisharXiv – CS AI · 9h ago5
🧠

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Researchers propose Trust Region Masking (TRM) to address off-policy mismatch problems in Large Language Model reinforcement learning pipelines. The method provides the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL tasks by masking entire sequences that violate trust region constraints.

AIBullisharXiv – CS AI · 9h ago7
🧠

Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance

Researchers propose SCOPE, a new framework for Reinforcement Learning from Verifiable Rewards (RLVR) that improves AI reasoning by salvaging partially correct solutions rather than discarding them entirely. The method achieves 46.6% accuracy on math reasoning tasks and 53.4% on out-of-distribution problems by using step-wise correction to maintain exploration diversity.

AIBullisharXiv – CS AI · 9h ago6
🧠

Automating the Refinement of Reinforcement Learning Specifications

Researchers introduce AutoSpec, a framework that automatically refines reinforcement learning specifications to help AI agents learn complex tasks more effectively. The system improves coarse-grained logical specifications through exploration-guided strategies while maintaining specification soundness, demonstrating promising improvements in solving complex control tasks.

AINeutralarXiv – CS AI · 9h ago3
🧠

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

Researchers propose SafeQIL, a new Q-learning algorithm that learns safe policies from expert demonstrations in constrained environments where safety constraints are unknown. The approach balances maximizing task rewards while maintaining safety by learning from demonstrated trajectories that successfully complete tasks without violating hidden constraints.

AIBullisharXiv – CS AI · 9h ago6
🧠

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Researchers introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that improves AI reasoning efficiency by helping large reasoning models know when to stop thinking. The approach addresses the problem of redundant, lengthy reasoning chains that don't improve accuracy while reducing computational costs and response times.

AIBullisharXiv – CS AI · 9h ago4
🧠

Foundation World Models for Agents that Learn, Verify, and Adapt Reliably Beyond Static Environments

Researchers propose a new framework for foundation world models that enables autonomous agents to learn, verify, and adapt reliably in dynamic environments. The approach combines reinforcement learning with formal verification and adaptive abstraction to create agents that can synthesize verifiable programs and maintain correctness while adapting to novel conditions.

AIBullisharXiv – CS AI · 9h ago2
🧠

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Researchers introduce Sea² (See, Act, Adapt), a novel approach that improves AI perception models in new environments by using an intelligent pose-control agent rather than retraining the models themselves. The method keeps perception modules frozen and uses a vision-language model as a controller, achieving significant performance improvements of 13-27% across visual tasks without requiring additional training data.

AIBullisharXiv – CS AI · 9h ago5
🧠

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Researchers developed CUDA Agent, a reinforcement learning system that significantly outperforms existing methods for GPU kernel optimization, achieving 100% faster performance than torch.compile on benchmark tests. The system uses large-scale agentic RL with automated verification and profiling to improve CUDA kernel generation, addressing a critical bottleneck in deep learning performance.

AIBullisharXiv – CS AI · 9h ago4
🧠

RF-Agent: Automated Reward Function Design via Language Agent Tree Search

Researchers introduce RF-Agent, a framework that uses Large Language Models as agents to automatically design reward functions for control tasks through Monte Carlo Tree Search. The method improves upon existing approaches by better utilizing historical feedback and enhancing search efficiency across 17 diverse low-level control tasks.

AIBullisharXiv – CS AI · 9h ago13
🧠

DeepEyesV2: Toward Agentic Multimodal Model

DeepEyesV2 is a new agentic multimodal AI model that combines text and image comprehension with external tool integration like code execution and web search. The research introduces a two-stage training pipeline and RealX-Bench evaluation framework, demonstrating improved real-world reasoning capabilities through adaptive tool invocation.

AIBullisharXiv – CS AI · 9h ago5
🧠

RUMAD: Reinforcement-Unifying Multi-Agent Debate

Researchers introduce RUMAD, a reinforcement learning framework that optimizes multi-agent AI debate systems by dynamically controlling communication topology. The system achieves over 80% reduction in computational costs while improving reasoning accuracy across benchmark tests, with strong generalization capabilities across different task domains.

AIBullisharXiv – CS AI · 9h ago7
🧠

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

Researchers developed SocialNav, a foundation model for socially-aware robot navigation that uses a hierarchical architecture to understand social norms and generate compliant movement paths. The model was trained on 7 million samples and achieved 38% better success rates and 46% improved social compliance compared to existing methods.

AINeutralarXiv – CS AI · 9h ago1
🧠

Heterogeneous Multi-Agent Reinforcement Learning with Attention for Cooperative and Scalable Feature Transformation

Researchers propose a new multi-agent reinforcement learning framework that uses three cooperative agents with attention mechanisms to automate feature transformation for machine learning models. The approach addresses key limitations in existing automated feature engineering methods, including dynamic feature expansion instability and insufficient agent cooperation.

AINeutralarXiv – CS AI · 9h ago1
🧠

Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration

Researchers propose OVMSE, a new framework for Offline-to-Online Multi-Agent Reinforcement Learning that addresses key challenges in transitioning from offline training to online fine-tuning. The framework introduces Offline Value Function Memory and Sequential Exploration strategies to improve sample efficiency and performance in multi-agent environments.

Page 1 of 2Next →