y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#deep-rl News & Analysis

11 articles tagged with #deep-rl. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles
AINeutralarXiv – CS AI · Jun 96/10
🧠

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

Researchers have developed a self-paced curriculum reinforcement learning framework for training autonomous agents to race superbikes in a physics-accurate simulator, combining Soft Actor-Critic algorithms with dynamic task progression. The approach demonstrates superior training efficiency and performance compared to traditional RL methods, establishing a new baseline for two-wheeled autonomous racing where balance and lean dynamics significantly increase complexity over four-wheeled vehicles.

AIBullisharXiv – CS AI · Jun 56/10
🧠

Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control

Researchers introduce Reflex, a reinforcement learning framework that exploits reflection symmetry in state-based continuous control tasks to improve sample efficiency. The method integrates with both on-policy (PPO) and off-policy (SAC) algorithms and demonstrates superior performance on standard benchmarks compared to baseline approaches.

🏢 OpenAI🏢 Google
AINeutralarXiv – CS AI · Jun 46/10
🧠

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

Researchers identify Trace-Mediated Peak Bias (TMPB), a systematic failure in deep reinforcement learning where agents irrationally prioritize high-magnitude reward spikes over trajectories with greater cumulative returns. This phenomenon mirrors the human Peak-End Rule cognitive bias and reveals how mathematical constraints in credit assignment systems naturally produce human-like value distortions, with adaptive optimizers offering a potential solution.

AINeutralarXiv – CS AI · Jun 46/10
🧠

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

Researchers propose a method to guarantee safety in reinforcement learning agents by using variational autoencoders and dual optimization to construct probabilistic barrier-certificates that identify safe versus unsafe behavior regions. The approach tightens safety bounds by targeting unexplored state-space regions during training, enabling deployment of RL systems with verified safety guarantees.

AINeutralarXiv – CS AI · Jun 46/10
🧠

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

Researchers propose simplicial embeddings, a lightweight geometric technique that constrains neural network representations to discrete, sparse structures, improving sample efficiency in reinforcement learning agents. When integrated into popular actor-critic algorithms like PPO and FastTD3, the method enhances performance and learning speed across diverse control tasks without sacrificing computational speed.

AINeutralarXiv – CS AI · Jun 46/10
🧠

Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics

Researchers introduce Unified Latent Dynamics (ULD), a reinforcement learning algorithm that combines the sample efficiency of model-free methods with the representational advantages of model-based approaches without requiring planning overhead. The method achieves competitive performance across 80 diverse environments including continuous control, visual tasks, and Atari games with minimal hyperparameter tuning.

🏢 Google
AINeutralarXiv – CS AI · May 286/10
🧠

Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning

Researchers have developed Diffusion-Augmented Markov Decision Processes (DA-MDPs), a framework that integrates diffusion models into maximum entropy reinforcement learning to sample from optimal policy trajectory distributions. The approach is tested on three RL algorithms (PPO, WPO, REPPO) and demonstrates competitive or superior performance on continuous-control tasks while excelling at modeling multimodal action distributions.

AINeutralarXiv – CS AI · May 276/10
🧠

Not All Transitions Matter: Evidence from PPO

Researchers propose a simple technique for stabilizing reinforcement learning training in PPO algorithms by randomly dropping 25% of transitions during rollouts. The method removes gradient redundancy caused by causally-dependent state sequences, improving training consistency across multiple environments without algorithmic modifications.

AIBullisharXiv – CS AI · May 116/10
🧠

Revisiting Adam for Streaming Reinforcement Learning

Researchers challenge the conventional wisdom that deep reinforcement learning requires replay buffers by demonstrating that classical update methods like C51 perform competitively in streaming online settings when paired with proper optimization techniques. The study identifies two critical properties—bounded objective derivatives and variance-adjusted weight updates—as essential for stable learning, leading to a new algorithm called Adaptive Q(λ) that substantially outperforms existing streaming approaches.

AINeutralarXiv – CS AI · May 96/10
🧠

AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

AdaGamma introduces a state-dependent discount factor method for deep reinforcement learning that learns to adjust discounting dynamically across different states, addressing instability issues in prior approaches through a return-consistency regularization objective. The method demonstrates empirical improvements when integrated into popular algorithms like SAC and PPO, with validated gains from real-world logistics deployment.

AINeutralarXiv – CS AI · May 76/10
🧠

Extending Differential Temporal Difference Methods for Episodic Problems

Researchers propose a generalization of differential temporal difference (TD) methods that extends their applicability from infinite-horizon to episodic reinforcement learning problems. By addressing how reward centering affects policy optimization in episodic settings, the work maintains theoretical guarantees while empirically demonstrating improved sample efficiency across multiple algorithms and environments.