Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
Researchers reverse-engineered a Sokoban-playing RNN trained with model-free reinforcement learning and discovered that the network encodes planning strategies through specialized neural channels that represent directional movements and learned transition models. The findings demonstrate that neural networks can develop interpretable planning algorithms without explicit supervision, with path channels and extension kernels working together to implement bidirectional search and backtracking.
This research represents a significant advance in mechanistic interpretability of neural networks, revealing how reinforcement learning agents develop structured planning algorithms in their hidden representations. By systematically analyzing a convolutional RNN trained on Sokoban, researchers identified discrete 'path channels' that encode directional push actions and learned how the network implements planning through kernel operations that propagate information bidirectionally from goals and boxes. The discovery that negative values encode obstacles and trigger backtracking shows the network learned a sophisticated search-like algorithm entirely through gradient descent without explicit architectural design for planning.
The work builds on growing efforts to understand deep learning systems through the lens of interpretability and mechanistic understanding. As AI systems become increasingly deployed in critical applications, understanding how they arrive at decisions becomes essential for safety, debugging, and improvement. This research demonstrates that reverse-engineering can unveil learned algorithms that match human-comprehensible concepts like planning and backtracking.
The implications extend beyond Sokoban. If similar interpretable structures emerge in larger, more complex neural networks, this methodology could help us understand planning in language models, decision-making systems, and other domains where transparency is valuable. The findings also suggest that model-free reinforcement learning naturally discovers efficient algorithms when given sufficient capacity and training signal, without requiring explicit inductive biases.
Future work should examine whether these mechanistic insights transfer to other RL agents and whether identifying such structures enables better model design, faster training, or improved generalization. The research opens pathways for extracting actionable knowledge from trained networks rather than treating them as black boxes.
- βSokoban RNN stores plans as activations in specialized 'path channels' that represent directional movements and box-pushing actions.
- βConvolutional kernels between channels encode learned transition models and implement bidirectional planning from both goals and obstacles.
- βNegative obstacle values trigger backtracking by propagating backwards through path channels, allowing the network to prune failed plans.
- βModel-free reinforcement learning discovers interpretable, human-comprehensible planning algorithms without explicit architectural supervision.
- βMechanistic reverse-engineering provides a framework for understanding decision-making in neural networks beyond treated-as-black-box approaches.