Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Researchers propose DeMix, a framework that uses model merging to efficiently determine optimal data mixtures for large language model pre-training without expensive repeated training cycles. The approach decouples the search process from training costs, enabling evaluation of multiple data combinations while also releasing a 22-token dataset to support open research.
DeMix addresses a fundamental challenge in LLM development: discovering the right balance of training data across different domains. Traditionally, researchers either conduct small-scale proxy experiments with unreliable results or undertake prohibitively expensive large-scale explorations. This framework transforms the problem by training component models once on individual datasets, then using weighted model merging to simulate how different mixture ratios would perform without retraining.
The approach fits within a broader trend of computational efficiency in AI research. As LLM training becomes increasingly expensive, techniques that reduce redundant computation gain significant value. Model merging itself has emerged as a powerful tool in recent years, allowing researchers to combine knowledge from different models without full retraining. DeMix extends this concept into the data mixture optimization domain, creating a new paradigm that separates the search phase from the training phase.
For developers and organizations building LLMs, this methodology could substantially reduce development costs and timelines. Companies can now explore more mixture combinations systematically before committing to full-scale training runs. The release of the DeMix Corpora—a comprehensive 22-token dataset with validated mixtures—democratizes access to high-quality training data and research-grade benchmarks, potentially accelerating development across the industry.
The framework's success depends on how well model merging predictions correlate with actual training outcomes at scale. Future work should validate whether these proxy predictions maintain accuracy across diverse model architectures and dataset compositions, and whether the optimal mixtures discovered through DeMix transfer effectively to different model sizes and domains.
- →DeMix decouples data mixture search from training costs using model merging to evaluate multiple combinations without retraining
- →The framework enables more comprehensive exploration of data ratios while reducing computational expenses compared to traditional approaches
- →A newly released 22-token DeMix Corpora provides high-quality training data with validated mixture ratios for open research
- →The methodology balances general competence with specialized performance on hard tasks like mathematics and code generation
- →Researchers can now achieve better performance outcomes with lower search costs by enabling unlimited mixture evaluations