Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink
Researchers demonstrate that single-bucket probes in Mamba-2 language models identify representational signatures but fail to capture complete computational circuits, missing up to half the execution layer. The study reveals that probe-based mechanistic interpretability can conflate detection mechanisms with execution mechanisms, with critical implications for model behavior—ablating identified head groups entirely collapses retrieval accuracy in downstream tasks.
This research exposes a fundamental limitation in how AI researchers use probes to understand neural network internals. Mechanistic interpretability has become increasingly important as models grow more complex and less transparent, but this work demonstrates that the standard approach of identifying representational signatures through probes creates a false equivalence with functional execution. In Mamba-2, the state sink phenomenon—where certain heads disproportionately activate on boundary tokens—appears to involve two distinct head populations with markedly different causal roles despite sharing similar representational patterns.
The distinction between BOS-specialist heads (5% of heads, strong causal effects) and dual heads (27-35% of heads, weak causal effects) illustrates this gap precisely. While both populations exhibit strong representational similarity, only specialists drive actual model behavior during inference. This finding challenges the assumption that representational similarity alone indicates functional equivalence, a cornerstone of current interpretability methodology.
The practical consequences are substantial. Ablation experiments show that removing specialist heads completely breaks performance on challenging retrieval tasks across multiple Mamba variants, reducing accuracy from 1.00 to 0.00 at meaningful context lengths. Conversely, removing the larger detection-layer complement preserves baseline performance, confirming that execution circuits operate independently from representational patterns. This distinction carries implications for model safety and steering: identifying which heads actually control behavior versus which merely correlate with computation fundamentally changes how researchers approach mechanistic control and interpretability work.
- →Single-bucket probes identify only 50% of execution circuits in Mamba-2, confusing detection layers with actual computational mechanisms
- →BOS-specialist heads (5% of total) carry disproportionate causal weight despite being vastly outnumbered by dual heads with similar representational signatures
- →Representational similarity does not guarantee functional equivalence—ablation studies reveal stark differences in causal effects between structurally similar head populations
- →Removing specialist heads collapses long-context retrieval performance to zero while removing detection-layer complements preserves baseline accuracy
- →Current mechanistic interpretability methodology requires class-conditional ablation rather than cosine similarity to properly distinguish execution from detection circuits