Emergent Slow Thinking in LLMs as Inverse Tree Freezing
Researchers present a statistical-physics framework explaining how large language models develop multi-step reasoning through reinforcement learning with verifiable rewards (RLVR), modeling the process as inverse tree freezing in a concept network. They propose Annealed-RLVR, a timing-optimized training method that outperforms standard RLVR by applying supervised fine-tuning at peak frustration rather than after convergence, preventing policy collapse.
This research bridges machine learning theory and statistical physics to explain an emergent phenomenon in LLM training. The authors model how autoregressive models compress their exponentially large prediction space into a Markov network where reasoning emerges as a random walk through a directed acyclic graph. The key insight is that RLVR training progresses through distinct phases—nucleation, growth, and freezing—governed by path merging and competitive frustration among incompatible reasoning chains.
The work extends beyond theoretical characterization by identifying a critical vulnerability in standard RLVR: supervised fine-tuning applied after tree freezing triggers catastrophic forgetting through structural rupture at bridging nodes. This finding has practical implications for LLM training protocols. The proposed Annealed-RLVR intervention strategically applies SFT during maximum frustration, exploiting the system's structural dynamics to stabilize multi-step reasoning without triggering forgetting.
The empirical validation on a 1.5-billion-parameter model and subsequent benchmarking demonstrates that timing is the active ingredient, not the intervention itself. This is significant because it suggests LLM training dynamics are highly structured and potentially predictable. The method shows largest improvements at high sampling budgets where standard RLVR typically collapses, addressing a critical scaling limitation.
For the AI development community, this research provides actionable insights into training stability and reasoning robustness. Understanding the structural phase transitions in reasoning emergence could inform more efficient training protocols and better guardrails against policy degradation. The framework also raises questions about how these dynamics scale to larger models and whether similar patterns emerge in other learning paradigms.
- →RLVR reasoning emergence follows predictable phase transitions modeled as inverse tree freezing in a concept network structure.
- →Standard RLVR with supervised fine-tuning causes catastrophic forgetting due to structural rupture at critical bridging nodes.
- →Annealed-RLVR applies fine-tuning during peak frustration before tree freezing, preventing policy collapse and improving scaling.
- →Timing of intervention is the critical factor—identical fine-tuning triggers forgetting if applied after structural freezing completes.
- →Reasoning chain lengthening emerges as a geometric necessity of sparse topology rather than a learned preference.