Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
Researchers introduce Double Preconditioning (DoPr), a new optimization technique that improves neural network performance during real-world deployment by combining gradient-wise and activation-wise preconditioning. The method addresses test-time feedback—the gap between training metrics and actual task performance in autoregressive models—without requiring improvements in traditional validation loss metrics.
Double Preconditioning represents a paradigm shift in how researchers approach optimization for deployed neural networks. The core insight challenges a fundamental assumption in machine learning: that minimizing validation loss necessarily translates to better real-world performance. In autoregressive tasks like language generation and robot control, networks trained on one-step predictions accumulate errors exponentially when rolled out sequentially, creating a persistent train-test mismatch known as test-time feedback.
The innovation combines two optimization strategies. Gradient-wise preconditioning, employed by optimizers like Adam and Muon, adjusts learning rates per parameter. Activation-wise preconditioning, derived from KFAC methods, extends this principle to neural activations themselves. This hybrid approach specifically targets the error accumulation problem inherent to sequential prediction tasks.
The significance extends beyond academic novelty. For practitioners deploying large language models, diffusion models, and robotic systems, DoPr offers a practical intervention requiring minimal implementation effort. The decoupling of validation loss improvements from downstream performance gains raises critical questions about model evaluation methodology. Organizations may discover their current metrics inadequately reflect production performance, necessitating revised evaluation protocols.
This work particularly impacts the AI infrastructure layer where optimization techniques are embedded. As models grow larger and deployment contexts become more complex, optimization algorithms increasingly determine practical performance. The finding that conventional metrics can mislead practitioners suggests the field requires broader introspection about evaluation standards across autoregressive systems.
- →DoPr combines gradient-wise and activation-wise preconditioning to reduce error accumulation in sequential prediction tasks.
- →Test-time feedback—performance gaps between validation metrics and real-world deployment—persists in autoregressive models regardless of training loss.
- →Downstream task performance improvements with DoPr occur independently of validation loss improvements, challenging traditional evaluation assumptions.
- →The technique applies across multiple domains including language modeling, generative modeling, and robot policy learning.
- →Results suggest the ML community needs revised evaluation frameworks specifically designed for test-time feedback scenarios.