AINeutralarXiv – CS AI · 6h ago6/10
🧠
TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition
Researchers introduce TD-Grokking, a training-time decomposition framework that enables large language models to learn from zero-reward problems by recursively breaking down unsolvable tasks into verifiable subproblems. This addresses a critical limitation in reinforcement learning with verifiable rewards (RLVR), where models typically fail to improve on challenging problems that produce uniform failure outcomes.