Researchers demonstrate that weight decay during language model pretraining significantly improves model plasticity—the ability to adapt to downstream tasks through fine-tuning. The study reveals counterintuitive findings where higher weight decay produces weaker base models but stronger performance after task-specific training, challenging conventional approaches to hyperparameter optimization.
This research addresses a fundamental gap in how large language models are optimized. Traditionally, pretraining hyperparameters are tuned based on validation loss alone, assuming better base model performance translates to better downstream results. The arXiv study disrupts this assumption by introducing plasticity as a distinct evaluation criterion, showing that weight decay's regularization effects create models better equipped for transfer learning despite initially appearing suboptimal.
The findings emerge from systematic experimentation revealing weight decay's mechanistic effects: it promotes linearly separable representations in learned features, regularizes attention patterns to be more structured, and reduces overfitting on pretraining data. These properties collectively enhance the model's capacity to learn new tasks efficiently rather than rigidly encoding pretraining-specific patterns.
For the AI development community, this work has immediate practical implications. Teams training foundation models may need to recalibrate weight decay settings away from configurations optimized purely for validation loss, potentially unlocking significant performance gains in downstream applications without retraining base models. This particularly benefits organizations deploying models across diverse tasks where adaptation capability matters more than raw pretraining metrics.
Looking forward, researchers should investigate whether similar plasticity-enhancing regularization strategies exist for other hyperparameters and model architectures. The work suggests that optimization research should systematically evaluate both pretraining performance and transfer learning capability, potentially revealing similar counterintuitive trade-offs in other domains of deep learning.
- →Weight decay during pretraining improves model plasticity, enabling better downstream task performance despite weaker base model validation loss
- →Current hyperparameter optimization for language models overlooks adaptability by focusing exclusively on validation loss metrics
- →Weight decay mechanistically promotes linear feature separability, regularizes attention patterns, and reduces training data overfitting
- →Higher weight decay can produce paradoxical results where worse pretraining performance leads to superior fine-tuning outcomes
- →Plasticity should be considered alongside validation loss when optimizing foundation model pretraining strategies