Researchers introduce Trust Region Q-Adjoint Matching (TRQAM), a reinforcement learning algorithm that stabilizes off-policy fine-tuning of pretrained flow policies by adaptively controlling deviation through trust-region constraints. The method demonstrates significant performance improvements, achieving 68% success rate on offline RL tasks compared to 46% for previous approaches.
TRQAM addresses a critical bottleneck in modern reinforcement learning: the instability that emerges when optimizing pretrained models through off-policy learning. The underlying problem stems from critic-guided improvement methods, where small errors in value estimation compound through multi-step sampling, potentially causing catastrophic model collapse. This research builds on prior work (QAM) that reformulated the problem as stochastic optimal control but identifies and solves a fundamental vulnerability in that approach.
The innovation centers on trust-region constraints that dynamically limit how far the learned policy can deviate from its pretrained baseline. By deriving a closed-form relationship between the trust-region parameter and path-space KL divergence, the authors enable precise control over policy drift. This mathematical insight transforms a previously heuristic parameter into a measurable, controllable quantity.
The empirical validation is substantial: across 50 OGBench tasks, TRQAM achieves 68% success in pure offline RL settings and shows marked improvements in offline-to-online scenarios where agents transition from batch data to live interaction. This 22-percentage-point improvement over the previous state-of-art (46%) represents meaningful progress in a challenging domain.
For practitioners developing AI systems that leverage pretrained models, this addresses a real operational concern: how to safely and effectively adapt foundation models to new objectives without degradation. The theoretical grounding combined with empirical validation suggests the approach could influence how reinforcement learning practitioners design model-adaptation pipelines, particularly in domains requiring offline learning from fixed datasets.
- βTRQAM achieves 68% success rate in offline RL, substantially outperforming the previous best baseline of 46%
- βThe method derives a closed-form relationship between trust-region parameters and policy deviation, enabling precise control
- βTrust-region constraints prevent critic-guided collapse by adaptively limiting deviation from pretrained policies
- βThe algorithm demonstrates consistent improvements across 50 OGBench tasks in both offline and offline-to-online settings
- βMathematical formulation as projected dual descent provides theoretical guarantees for stable off-policy fine-tuning