Position: Deployed Reinforcement Learning should be Continual
A position paper argues that deployed reinforcement learning systems should adopt continual learning rather than the traditional train-then-fix approach. The authors identify four sources of non-stationarity in deployed environments that require agents to continuously adapt and learn, challenging the current industry paradigm where agents remain static until performance degradation necessitates retraining.
The reinforcement learning community faces a fundamental architectural mismatch between how systems are deployed and how they should operate in dynamic real-world environments. Current practice relies on training agents in controlled settings, then deploying them unchanged until performance metrics trigger expensive retraining cycles. This position paper challenges this approach by arguing that any deployed agent incapable of achieving optimal performance while receiving evaluative feedback inherently constitutes a continual learning problem.
The core insight stems from recognizing that real-world deployments encounter persistent non-stationarity beyond what training datasets capture. The four identified sources of non-stationarity—environmental shifts, user behavior changes, system degradation, and goal evolution—mean deployed agents inevitably face distribution shifts their training never encountered. Traditional approaches treat this as a failure state requiring intervention rather than an expected operational condition.
For practitioners building AI systems, this framing suggests significant operational and economic advantages. Continual learning architectures reduce reliance on expensive retraining pipelines, enable faster adaptation to market conditions or user preferences, and create more robust systems. Organizations currently managing complex train-then-fix cycles could benefit from systems that gracefully adapt to changing conditions in near real-time.
The paper's emphasis on analyzing successful real-world continual RL implementations provides evidence that mature deployment strategies already incorporate adaptive learning. Moving forward, the community should develop better techniques for stability, performance monitoring, and controlled learning during deployment—transforming what many treat as a problem into a fundamental design principle.
- →Current train-then-fix deployment paradigms are fundamentally misaligned with real-world non-stationary environments where agents should continually adapt.
- →Four major sources of non-stationarity after deployment—environment changes, user behavior shifts, system degradation, and goal evolution—necessitate never-ending learning capabilities.
- →Continual learning approaches reduce operational costs associated with expensive retraining cycles and enable faster adaptation to changing conditions.
- →Successful real-world RL deployments already incorporate adaptive learning mechanisms, suggesting industry recognition of the train-then-fix paradigm's limitations.
- →Moving forward requires developing techniques for safe, stable, and monitored learning during production deployment rather than treating adaptation as a failure state.