AINeutralarXiv – CS AI · 8h ago7/10
🧠
Universal One-third Time Scaling in Learning Peaked Distributions
Researchers demonstrate that the slow power-law convergence observed during large language model training stems fundamentally from softmax and cross-entropy operations when learning peaked distributions. This universal 1/3 time scaling exponent represents an intrinsic optimization bottleneck that could explain neural scaling laws and potentially guide more efficient training methods.