Dense Supervision Is Not Enough: The Readout Blind Spot in Looped Language Models
Researchers identify a critical supervision blind spot in looped language models where dense cross-entropy loss fails to control hidden-state scale variables in recurrent transitions. The study demonstrates that scale-invariant readout mechanisms like RMSNorm hide radial scaling from loss functions, allowing uncontrolled norm growth in the thousands, and proposes architectural solutions including scale-visible readouts and explicit normalization to improve model efficiency and perplexity at matched inference depths.
This research addresses a fundamental architectural problem in recurrent language models that has practical implications for efficient inference. Looped transformers process information through iterative cycles, feeding hidden states back into computation while simultaneously generating predictions. The supervision challenge arises because standard per-loop cross-entropy loss only constrains variables directly exposed by readout mechanisms, creating a blind spot for internal state variables that evolve through recurrent transitions.
The core issue stems from widely-adopted normalization schemes like RMSNorm and LayerNorm, which normalize away radial scale information from the immediate supervision signal. Meanwhile, pre-norm residual connections continue accumulating and amplifying this uncontrolled scale through recurrent cycles. In tested models ranging from 44M to 129M parameters, this mechanism drives hidden-state norms into extreme ranges—thousands to tens of thousands—despite dense per-loop supervision.
This finding carries practical significance for deploying language models with variable-depth early exits, a key efficiency technique. While dense supervision successfully trains exit points for immediate prediction, it fails to constrain the recurrent scale dynamics that affect subsequent computation. The researchers propose two complementary solutions: making scale explicitly visible to loss functions through modified readout designs, or architecturally removing scale from the recurrent loop entirely.
Their experiments demonstrate that scale-controlled variants achieve measurably lower perplexity at matched inference depths compared to standard approaches. This work bridges an overlooked gap between supervision theory and recurrent architecture design, offering clear design principles for practitioners building efficient looped models. The implications extend beyond academic interest, affecting inference efficiency and model calibration in production systems.
- →Dense cross-entropy loss provides incomplete supervision in looped models, controlling only readout-exposed variables while leaving recurrent scale uncontrolled
- →Scale-invariant normalization schemes like RMSNorm hide radial information from loss functions, allowing hidden-state norms to grow uncontrollably through recurrent cycles
- →Making scale visible to loss functions or removing it from recurrence entirely resolves the supervision blind spot and improves perplexity efficiency
- →Uncontrolled scale particularly affects variable-depth early exits, where dense supervision trains exits but fails to constrain recurrent dynamics
- →Proposed architectural fixes include scale-visible readouts and explicit norm penalties, achieving measurably better performance at matched inference depths