#offline-rl News & Analysis

24 articles tagged with #offline-rl. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

24 articles

AIBullisharXiv – CS AI · Jun 107/10

🧠

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Researchers propose QGF (Q-Guided Flow), a reinforcement learning algorithm that optimizes policies entirely at test time using value gradients to guide pre-trained flow models, avoiding the training instability issues of traditional actor-critic approaches while maintaining competitive performance on offline RL benchmarks.

AIBullisharXiv – CS AI · May 277/10

🧠

Yes, Q-learning Helps Offline In-Context RL

Researchers demonstrate that integrating reinforcement learning objectives into offline in-context RL frameworks significantly outperforms supervised learning approaches like Algorithm Distillation, achieving ~30% performance improvements across diverse environments and doubling performance in complex settings. The findings validate that aligning ICRL training with RL reward-maximization goals, particularly through conservative value learning, produces more effective agents.

AIBullisharXiv – CS AI · Feb 277/106

🧠

Decision MetaMamba: Enhancing Selective SSM in Offline RL with Heterogeneous Sequence Mixing

Researchers propose Decision MetaMamba (DMM), a new AI model architecture that improves offline reinforcement learning by addressing information loss issues in Mamba-based models. The solution uses a dense layer-based sequence mixer and modified positional structure to achieve state-of-the-art performance with fewer parameters.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Offline Multi-agent Continual Cooperation via Skill Partition and Reuse

Researchers introduce COMAD, a framework for multi-agent reinforcement learning systems to continually discover and reuse coordination skills from offline data without catastrophic forgetting. The approach uses skill partitioning and density-based reusability estimation to enable agents to efficiently transfer knowledge across sequential tasks in open environments.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

Researchers introduce DOM2, a diffusion-based offline multi-agent reinforcement learning algorithm that significantly improves policy expressiveness and generalization. The method achieves 20x better data efficiency and superior performance across standard benchmarks while maintaining robustness to environment shifts.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

Researchers introduce CDQAC, an offline reinforcement learning algorithm that learns effective job scheduling policies from static, suboptimal datasets rather than requiring extensive online training interactions. The breakthrough demonstrates that scheduling performance depends primarily on state-action coverage rather than trajectory quality, enabling the algorithm to learn effectively from even simple random heuristics while requiring only 1-5% of original dataset size.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

Researchers introduce Bootstrapped Flow Q-Learning (BFQ), a new offline reinforcement learning method that achieves single-step action generation without multi-step denoising, improving computational efficiency and performance over existing diffusion-based approaches. The framework eliminates auxiliary networks and distillation procedures while maintaining high expressiveness, demonstrated through D4RL benchmark evaluations.

AINeutralarXiv – CS AI · Jun 96/10

🧠

On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

Researchers present a theoretical framework for offline reinforcement learning that answers a fundamental open question negatively: Q*-realizability and Bellman completeness alone are insufficient for sample-efficient learning under partial coverage. The work introduces a decision-estimation framework that improves sample complexity bounds for practical algorithms like Conservative Q-Learning and extends theoretical understanding to previously unexplored settings.

AIBullisharXiv – CS AI · Jun 26/10

🧠

Drift Q-Learning

Researchers propose DriftQL, a new offline reinforcement learning method that combines drift-based behavioral regularization with critic-driven policy improvement to outperform diffusion and flow-based policies. The approach achieves single forward-pass inference while maintaining robustness under degraded data quality, advancing state-of-the-art performance on standard benchmarks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

Researchers propose a novel offline meta-reinforcement learning framework combining information-theoretic task representation learning with Transformer-based world models to address distribution shifts in sparse-reward environments. The approach extracts behavior-invariant task representations and applies conservative value penalties to prevent model exploitation, demonstrating improved generalization over existing methods.

AIBullisharXiv – CS AI · May 296/10

🧠

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

Researchers introduce BORA, an offline-to-online reinforcement learning framework that enables Vision-Language-Action (VLA) models to perform complex dexterous robotic manipulation tasks more reliably in real-world settings. The method combines offline critic training with lightweight online adaptation, achieving 33% improvement in success rates over traditional imitation learning approaches.

AIBullisharXiv – CS AI · May 286/10

🧠

Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

Researchers demonstrate that offline reinforcement learning can effectively improve code-generating LLMs by leveraging existing datasets, eliminating the computational overhead of online RL while delivering comparable or superior performance, particularly for smaller models and complex coding tasks.

AINeutralarXiv – CS AI · May 126/10

🧠

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

DOSER introduces a diffusion-model-based framework for offline reinforcement learning that improves out-of-distribution (OOD) action detection beyond traditional penalization methods. The approach uses single-step denoising reconstruction error to identify risky actions while selectively encouraging beneficial exploration, with theoretical guarantees of convergence and empirical superiority on suboptimal datasets.

AINeutralarXiv – CS AI · May 126/10

🧠

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

Researchers propose Path-Coupled Bellman Flows (PCBF), a novel distributional reinforcement learning method that addresses limitations in existing flow-based approaches by using source-consistent paths and shared noise coupling to improve training stability and return distribution fidelity. The approach demonstrates competitive performance on benchmark tasks while maintaining computational efficiency through variance-reduction techniques.

AINeutralarXiv – CS AI · May 116/10

🧠

Learning Visual Feature-Based World Models via Residual Latent Action

Researchers introduce Residual Latent Action (RLA), a new latent action representation learned from DINO visual features, enabling more efficient and accurate world models that predict future visual features rather than raw pixels. RLA-WM outperforms existing feature-based and video-diffusion approaches while being orders of magnitude faster, with applications in robot learning from offline video demonstrations.

AINeutralarXiv – CS AI · May 16/10

🧠

EXPO: Stable Reinforcement Learning with Expressive Policies

Researchers introduce EXPO, a reinforcement learning algorithm that trains expressive policies (like diffusion models) more efficiently by avoiding direct value optimization. The method uses a lightweight Gaussian policy to edit actions from a base policy, achieving 2-3x improvements in sample efficiency for both offline-to-online and fine-tuning scenarios.

AINeutralarXiv – CS AI · Mar 55/10

🧠

IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

Researchers propose Imaginary Planning Distillation (IPD), a novel framework that enhances offline reinforcement learning by incorporating planning into sequential policy models. IPD uses world models and Model Predictive Control to generate optimal rollouts, training Transformer-based policies that significantly outperform existing methods on D4RL benchmarks.

AIBullisharXiv – CS AI · Mar 27/1016

🧠

SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

Researchers developed Score Matched Actor-Critic (SMAC), a new offline reinforcement learning method that enables smooth transition to online RL algorithms without performance drops. SMAC achieved successful transfer in all 6 D4RL tasks tested and reduced regret by 34-58% in 4 of 6 environments compared to best baselines.

AINeutralarXiv – CS AI · Mar 174/10

🧠

Chunk-Guided Q-Learning

Researchers introduce Chunk-Guided Q-Learning (CGQ), a new offline reinforcement learning algorithm that combines single-step and multi-step temporal difference learning approaches. The method achieves better performance on long-horizon tasks by reducing error accumulation while maintaining fine-grained value propagation, with theoretical guarantees and empirical validation on OGBench tasks.

AINeutralarXiv – CS AI · Mar 54/10

🧠

Selecting Offline Reinforcement Learning Algorithms for Stochastic Network Control

Research evaluates offline reinforcement learning algorithms for wireless network control, finding Conservative Q-Learning produces more robust policies under stochastic conditions than sequence-based methods. The study provides practical guidance for AI-driven network management in O-RAN and 6G systems where online exploration is unsafe.

AINeutralarXiv – CS AI · Mar 44/103

🧠

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

Researchers propose SAGE (Self-supervised Action Gating with Energies), a new method to improve diffusion planners in offline reinforcement learning by filtering out dynamically inconsistent trajectories. The approach uses a latent consistency signal to re-rank candidate actions at inference time, improving performance across locomotion, navigation, and manipulation tasks without requiring environment rollouts or policy retraining.

AINeutralarXiv – CS AI · Mar 34/102

🧠

Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

Researchers introduce Return Augmented (REAG) method for Decision Transformer frameworks to improve offline reinforcement learning when training data comes from different dynamics than the target domain. The method aligns return distributions between source and target domains, with theoretical analysis showing it achieves optimal performance levels despite dynamics shifts.

AINeutralarXiv – CS AI · Mar 24/106

🧠

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Researchers developed a new pessimistic auxiliary policy for offline reinforcement learning that reduces error accumulation by sampling more reliable actions. The approach maximizes the lower confidence bound of Q-functions to avoid high-value actions with potentially high errors during training.

AINeutralarXiv – CS AI · Mar 24/105

🧠

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Researchers present theoretical advances in offline reinforcement learning that extend beyond current limitations to work with parameterized policies over large or continuous action spaces. The work connects mirror descent to natural policy gradient methods and reveals a surprising unification between offline RL and imitation learning.