The Shape of Wisdom: Decision Trajectories in Language Models
Researchers analyzed how language models make decisions by tracing answer scores across neural network layers in 9,000 MMLU trajectories, finding that correct answers are often unstable and that attention mechanisms better preserve correctness than MLP layers. The study reveals decision-making is a distributed process rather than a final-layer phenomenon, with implications for understanding model reliability and interpretability.
This research directly addresses a critical gap in neural network interpretability: how do language models actually arrive at decisions? Rather than treating the output layer as the sole decision point, the study traces decision trajectories through intermediate layers, revealing that correctness and confidence are decoupled properties. The largest category of responses—unstable-correct answers—suggests models often arrive at right answers through fragile reasoning paths vulnerable to perturbation.
The methodology extends beyond standard interpretability by measuring three distinct metrics: answer margin, margin change, and proximity to decision flips. This granular approach enables researchers to identify which answers remain settled throughout computation versus those that wobble precariously toward incorrect alternatives. The finding that attention mechanisms preserve correctness while MLPs undermine it challenges assumptions about layer-wise contributions to accurate reasoning.
For the AI development community, these insights matter substantially. Practitioners building production systems need assurance that model answers aren't accidentally correct—that the reasoning is robust rather than contingent. The span deletion experiments showing that removing answer-supporting text hurts margins while removing distractors helps them provide a concrete lever for understanding what the model actually learned versus what it merely correlates with inputs.
Moving forward, this work opens avenues for improving training procedures that might enforce stable decision trajectories rather than merely optimizing final outputs. Understanding failure modes in intermediate layers could enable better detection of hallucination-prone responses before they reach users, directly improving deployed system reliability.
- →Language models make decisions through distributed processes across layers, not just at the output layer
- →Correct answers are frequently unstable, meaning models reach right conclusions through fragile reasoning paths
- →Attention mechanisms better preserve answer correctness than MLP layers in stable-correct cases
- →Span deletion experiments show removing answer-supporting text hurts margins while removing distractors improves them
- →Decision trajectory analysis provides a reproducible method to identify fragile versus settled model predictions