AINeutralarXiv – CS AI · 9h ago6/10
🧠
Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity
Researchers demonstrate that Masked Diffusion Language Models fundamentally alter neural network learning dynamics on the k-parity problem, eliminating the typical grokking phenomenon and enabling faster generalization. By decomposing the MD objective into signal and noise regimes, they optimize mask probability distribution, achieving up to 8.8% performance improvements on 50M-parameter models and 5.8% gains on 8B-parameter models.
🏢 Perplexity