Recoverable but Not Stationary:Local Linear Structures in Weights and Activations
Researchers demonstrate that linear structures in neural networks exist locally rather than globally, with task-specific directions that evolve during training rather than remaining stationary. Their findings on transformer models and LoRA adapters suggest that parameter adjustment techniques like task vectors work through dynamic geometric patterns that partially align across weight and activation spaces.
This research addresses a fundamental assumption underlying modern neural network control techniques: that learned behaviors occupy fixed linear directions in weight space. The authors systematically challenge this premise through experiments on both synthetic transformers and real models like DistilGPT-2 and GPT-2, revealing that while low-rank task-gradient structures exist, they are fundamentally non-stationary.
The work builds on emerging trends in mechanistic interpretability and controllable AI, where researchers have observed that linear operations—from task vectors to activation steering—can effectively modify model behavior. Previous research suggested these linear directions were stable, task-specific planes in high-dimensional space. This study instead demonstrates that useful bases drift significantly within 100 training steps, with only the initial trajectory-prefix capturing meaningful recovery displacement at 77% efficiency.
The implications extend to practical AI development and safety. If linear structures are locally recoverable but globally unstable, it suggests that model editing and steering techniques require continuous recalibration rather than one-time calibration. This affects how practitioners design adaptation methods and safety interventions. The Gaussian local-linear theorem the authors develop explains why random search succeeds even in extreme dimensionality, providing theoretical grounding for empirical observations in parameter space exploration.
For the AI research community, these findings redirect focus from discovering universal geometric properties toward understanding local geometric evolution. The demonstrated 0.58 cosine similarity between single gradient steps and activation steering vectors indicates coupling between parameter and activation spaces, suggesting integrated approaches to model control may be more effective than separate parameter-space and activation-space techniques.
- →Linear task structures in neural networks are locally recoverable but non-stationary, drifting substantially during training rather than remaining fixed.
- →Trajectory-prefix bases from initial recovery updates capture 77% of LoRA adaptation displacement, suggesting early training dynamics are crucial.
- →Parameter perturbations and activation steering show strong coupling, with single gradient steps producing activation shifts comparable to dedicated steering vectors.
- →Random parameter search effectiveness in high dimensions is theoretically justified by a Gaussian local-linear theorem applicable to neural network geometry.
- →Practical model editing and steering techniques require local recalibration rather than relying on static learned directions.