Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
Researchers propose STHTD-MP, a new machine learning algorithm that improves off-policy prediction by using behavior-policy information to optimize the geometry of gradient temporal-difference methods. The method demonstrates faster convergence than existing approaches like GTD2-MP under certain conditions, with theoretical guarantees and empirical validation on standard benchmarks.
This paper addresses a fundamental challenge in reinforcement learning: improving the convergence speed of off-policy prediction algorithms. Gradient temporal-difference methods have long been valued for their stability with linear function approximation, but their practical performance hinges on the geometric properties of the auxiliary-variable metric used in their formulation. The proposed STHTD-MP algorithm represents an incremental but meaningful advance by leveraging behavior-policy transition information to inform the update geometry, rather than relying solely on feature covariance metrics as prior Mirror-Prox methods do.
The research builds on established work in hybrid temporal-difference methods, which have suggested that behavior-policy information can guide learning more effectively. By incorporating this insight into the Mirror-Prox framework, the authors create a unified approach that maintains computational efficiency while potentially accelerating convergence. The theoretical contribution is substantial: the paper provides rigorous convergence analysis under standard assumptions, derives ergodic gap bounds, and offers exact comparisons with GTD2-MP based on spectral properties of the error matrix.
The empirical validation on canonical benchmarks—two-state problems, Random Walk, and Boyan Chain—confirms that STHTD-MP achieves smaller contraction factors than GTD2-MP when the behavior-induced metric improves saddle-point geometry. Importantly, the authors identify Baird's counterexample as a boundary case where the method's assumptions break down, demonstrating scientific rigor and boundary awareness. For the reinforcement learning and AI optimization community, this work provides a technically sound alternative with clear conditions for performance improvements, enabling practitioners to better tune their algorithm selection based on problem structure rather than relying on generic defaults.
- →STHTD-MP replaces feature covariance metrics with behavior-policy Bellman matrix information to improve convergence geometry in off-policy prediction.
- →Theoretical analysis guarantees convergence under standard assumptions and provides exact mean-operator comparisons showing potential speedups over GTD2-MP.
- →Empirical validation on benchmark problems demonstrates faster convergence when behavior-induced metrics improve saddle-point geometry.
- →The method maintains single learning rate simplicity while applying Mirror-Prox prediction-correction, reducing hyperparameter tuning complexity.
- →Baird's counterexample identified as a boundary case reveals specific problem conditions where the strict theoretical assumptions may fail.