TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition
Researchers introduce TD-Grokking, a training-time decomposition framework that enables large language models to learn from zero-reward problems by recursively breaking down unsolvable tasks into verifiable subproblems. This addresses a critical limitation in reinforcement learning with verifiable rewards (RLVR), where models typically fail to improve on challenging problems that produce uniform failure outcomes.
TD-Grokking tackles a fundamental challenge in AI model training: the inability to optimize performance on problems where all attempted solutions fail. In standard reinforcement learning with verifiable rewards, models require feedback signals to improve, but zero-reward scenarios provide no gradient for learning. This bottleneck has constrained progress on complex reasoning tasks despite advances in large language models.
The framework's innovation lies in recursive problem decomposition. Rather than treating intractable problems as monolithic challenges, TD-Grokking breaks them into hierarchical trees of self-contained subproblems. The leaves of these trees are verifiable and solvable, generating non-zero reward signals that propagate through the training process. This hierarchical structure transforms impossible optimization problems into manageable learning opportunities.
Empirical results on mathematical and medical reasoning tasks demonstrate consistent improvements over baseline approaches, including vanilla GRPO. The method's effectiveness suggests that problem structure can be systematically exploited during training to overcome apparent learning impasses. This has implications for training more capable AI systems on genuinely difficult reasoning tasks where traditional reward signals fail.
For the AI research community, this work signals progress toward more robust training paradigms for complex reasoning. The approach may enable models to tackle previously intractable problem classes, particularly in domains requiring multi-step logical inference. However, the practical applicability depends on the feasibility of meaningful decomposition across diverse problem types.
- βTD-Grokking decomposes zero-reward problems into hierarchical trees of verifiable subproblems to generate usable training signals
- βThe framework outperforms vanilla GRPO and existing baseline methods on mathematical and medical reasoning benchmarks
- βRecursive decomposition addresses a critical bottleneck where models cannot improve on tasks producing uniform failure outcomes
- βThe approach enables reinforcement learning to function effectively on complex reasoning tasks previously considered intractable
- βCode and datasets are publicly available, supporting reproducibility and broader adoption in the research community