AINeutralarXiv – CS AI · 18h ago6/10
🧠
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them
Researchers identify that data mixture optimization for AI model pre-training fails at scale due to 'repetition mismatch'—when high-quality datasets are small, their repetition rates change as training budgets grow, invalidating small-scale experiments. A subsampling procedure that controls for target repetition rates enables accurate mixture prediction using only 1/16 of tokens versus traditional methods requiring 44-94% of the full budget.