y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

arXiv – CS AI|Kevin Zhou, Lisa Alazraki, Kris Cao, Marek Rei|
🤖AI Summary

Researchers identify that data mixture optimization for AI model pre-training fails at scale due to 'repetition mismatch'—when high-quality datasets are small, their repetition rates change as training budgets grow, invalidating small-scale experiments. A subsampling procedure that controls for target repetition rates enables accurate mixture prediction using only 1/16 of tokens versus traditional methods requiring 44-94% of the full budget.

Analysis

This research addresses a fundamental inefficiency in large language model development: the inability to predict optimal data mixtures without consuming massive computational budgets. The core problem emerges when researchers run small-scale experiments to optimize which data sources should comprise training sets, then extrapolate results to larger budgets. However, when premium datasets are limited and must be repeated multiple times, their proportion in the overall mixture changes unpredictably as training scales, rendering initial experiments obsolete.

The repetition mismatch phenomenon has practical consequences for AI labs and companies developing foundation models. High-quality training data remains scarce and expensive, while web-crawled data is abundant but lower quality. Current practice requires running three to four separate experiments at different scales to find optimal mixtures, consuming up to 94% of the target token budget before training even begins. This represents enormous wasted computational resources.

The proposed solution—matching repetition rates in small-scale experiments to target conditions—dramatically improves efficiency. For two-source mixtures, a single 1/16-scale experiment with repetition control reduces error from 0.75 to 0.05, effectively eliminating extrapolation failures. Even with three data sources, two controlled experiments outperform traditional baselines requiring full two-source testing.

This work influences how organizations allocate resources toward model pre-training. By reducing the experimentation overhead, labs can achieve better data mixture optimization with lower computational costs, freeing resources for other development priorities. The research establishes data repetition as a critical variable worthy of explicit control rather than treating it as an artifact of limited data availability.

Key Takeaways
  • Data repetition rate changes are the primary cause of small-scale mixture experiments failing to scale, not scale differences alone.
  • Repetition-controlled subsampling enables accurate mixture optimization using only 6.25% of the target training budget versus 44-94% for conventional methods.
  • The approach generalizes across multiple data source configurations and model scales, suggesting broad applicability in LLM development.
  • Data repetition should be treated as an explicit optimization variable rather than an inconvenient side effect of limited data availability.
  • Organizations can significantly reduce pre-training experimentation costs while improving mixture accuracy by implementing repetition control protocols.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles