y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

arXiv – CS AI|Sarwan Ali|
🤖AI Summary

Researchers identify a critical training window where Transformer models decide between memorization and reasoning, finding that applying weight decay during a specific 25% training phase matches full-training performance on compositional tasks. The discovery reveals sharp boundaries in this decision point, with timing shifts of just 100 optimization steps causing dramatic accuracy swings from chance performance to robust reasoning.

Analysis

This research addresses a fundamental question in deep learning: how do neural networks choose between memorizing training data and learning generalizable reasoning patterns? The findings reveal that complexity control—achieved through weight decay regularization—operates as a dynamic phenomenon rather than a static parameter choice, with profound implications for model training strategies.

The work builds on recent progress showing that Transformer compositional generalization depends critically on initialization scale and regularization choices. However, previous analyses treated these decisions as uniform throughout training. This study demonstrates that the memorization-versus-reasoning outcome crystallizes within a narrow, identifiable window typically occurring mid-training. Remarkably, applying weight decay only during a 25% training window achieves comparable out-of-distribution accuracy (0.93) to full-training regularization (0.91), suggesting massive inefficiency in conventional approaches.

The sharpness of the critical window creates both opportunities and challenges. Shifting the window onset by merely 100 steps produces out-of-distribution accuracy swings from 0.15 (chance) to 0.61 (reasoning regime), indicating an almost discontinuous phase transition in model behavior. This sensitivity suggests neural networks undergo qualitative reorganization at specific training stages. Counterintuitively, smaller initialization scales, typically recommended for stability, actually shrink the basin of attraction for reasoning solutions, contradicting prevailing best practices.

The phenomenon's task-specificity—absent in modular arithmetic grokking tasks—indicates these principles operate selectively rather than universally. For practitioners, this suggests optimal training requires understanding task-specific critical windows rather than applying uniform regularization schedules. Future work should identify whether window positions can be predicted from task properties, potentially enabling automated curriculum design and computational efficiency improvements.

Key Takeaways
  • Weight decay applied only during a 25% mid-training window matches full-training weight decay effectiveness, revealing massive regularization inefficiency
  • The critical window for memorization-versus-reasoning decisions exhibits sharp boundaries where 100-step timing shifts cause out-of-distribution accuracy to swing from chance to reasoning performance
  • Smaller initialization scales contradict conventional wisdom by shrinking the reasoning solution basin rather than improving model robustness
  • Critical windows are task-specific, appearing in compositional generalization but not in modular arithmetic grokking, suggesting phenomenon selectivity
  • Optimal training requires understanding task-specific critical windows rather than applying uniform regularization, enabling computational efficiency improvements
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles