AINeutralarXiv – CS AI · 14h ago5/10
🧠
Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
Researchers propose STHTD-MP, a new machine learning algorithm that improves off-policy prediction by using behavior-policy information to optimize the geometry of gradient temporal-difference methods. The method demonstrates faster convergence than existing approaches like GTD2-MP under certain conditions, with theoretical guarantees and empirical validation on standard benchmarks.