The Algorithm Is Not the Behavior: Learned Priors Override Look-Ahead in a Chess-Playing Neural Network
Researchers discovered that Leela Chess Zero, a top neural chess engine, internally computes correct solutions to chess puzzles but systematically overrides them in final outputs—a phenomenon driven by learned safety priors rather than algorithmic failure. This reveals a critical gap between internal algorithmic capability and external behavior in neural networks.
The research exposes a fundamental disconnect in how neural networks operate: the presence of learned algorithms does not guarantee their execution in practice. Leela Chess Zero demonstrates sophisticated internal look-ahead mechanisms that correctly identify immediate checkmates and optimal moves in intermediate layers, yet the model's final output frequently selects suboptimal moves. The phenomenon termed 'forgotten puzzles' challenges assumptions that mechanistic interpretability studies can directly predict model behavior.
This work builds on growing evidence that neural networks learn multiple competing objectives simultaneously. While previous mechanistic analyses confirmed look-ahead algorithms function correctly—with future moves represented, causally important, and linearly decodable—this study shows that higher-level learned priors override algorithmic outputs. Specifically, late layers increasingly prioritize conservative, safe play over aggressive winning positions, suggesting the model has learned a meta-objective beyond strict move optimization.
The implications extend beyond chess. If safety priors can override internally-computed solutions in game-playing agents, similar dynamics may affect language models, vision systems, and other neural networks claiming mechanistic transparency. This complicates AI safety and alignment efforts, as observing algorithmic structure in intermediate layers provides false confidence about actual model behavior. The finding that steering against learned preferences recovers 61.7% of forgotten puzzles demonstrates that algorithmic override isn't inevitable but rather a learned choice.
Future research must distinguish between algorithmic capability and behavioral output, potentially requiring new interpretability methods that measure behavioral alignment rather than computational structure alone. This work suggests that understanding neural networks requires analyzing the full inference chain, including how competing objectives interact in final decision layers.
- →Neural networks can internally compute correct solutions while deliberately overriding them based on learned behavioral priors
- →Mechanistic interpretability of intermediate layers does not guarantee understanding of final model outputs
- →Safety and conservative preferences learned during training can override algorithmic optimization in game-playing agents
- →Causal intervention shows override behavior is learned and modifiable, not inevitable
- →AI alignment and safety research must account for competing objectives that emerge in late network layers