AINeutralarXiv – CS AI · 15h ago6/10
🧠
Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
Researchers identify critical failure modes in policy-gradient reinforcement learning methods when applied to long-horizon problems with cumulative damage, where short-term attractive actions lead to long-term negative outcomes. The study proposes a decomposition framework separating completion (reaching terminal horizon) from optimality (achieving dynamic-programming benchmarks) and validates predictions across two distinct domains: career planning and sports performance.