When and How Long? The Readout-Mediator Angle in Temporal Reasoning
Researchers demonstrate that linear probes can successfully decode information from neural networks while remaining completely disconnected from how models actually process that information. Using calendar-date reasoning tasks, they show that probes identifying day-of-year information are orthogonal to the causal mechanisms models use for duration reasoning, revealing a fundamental flaw in probe-based interpretability methods.
This research exposes a critical vulnerability in a widely-used interpretability technique: linear probes cannot be trusted to identify causal computation paths within neural networks. The authors discovered that a probe successfully recovering day-of-year information from model activations had zero relevance to actual model computation, while a subspace found through Distributed Alignment Search (DAS) was essential for correct answers. The readout-mediator angle between these subspaces matched random expectations, proving the probe operated in a direction the model had effectively abandoned. Through circuit analysis, researchers revealed the true computational path: attention heads use learned offsets to route month-grained context, and subsequent MLPs convert absolute dates into durations—all downstream of where the probe searched. This dissociation held consistently across model scales from 1.5 billion to 9 billion parameters and multiple architectures, with preliminary evidence suggesting the phenomenon generalizes beyond temporal reasoning. The implications are severe for AI safety applications. Organizations proposing to deploy probes as runtime safety monitors face a critical problem: probes can report high confidence on meaningless directions while the model silently executes computations elsewhere. This creates false confidence in safety measures that provide no actual oversight. The research highlights how interpretability methods requiring only linear operations can fundamentally mislead researchers about neural network behavior, suggesting that understanding AI systems requires deeper mechanistic investigation than probe-based approaches provide.
- →Linear probes can decode information perfectly while being completely orthogonal to actual model computation pathways
- →The readout-mediator angle metric reveals when probes extract spurious correlations rather than causal features
- →Sparse autoencoders and distributed alignment search identify causally-relevant subspaces that linear probes systematically miss
- →Probe-based safety monitoring creates dangerous false confidence by reporting high accuracy on abandoned computational directions
- →The phenomenon replicates across multiple model scales and families, suggesting orthogonality is a general interpretability failure mode