Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining
Researchers demonstrate that synthetic data composition significantly impacts foundation model pretraining for time series forecasting, with a 2× performance gap between best and worst generators. Rather than selecting individual generators, an equal-weight mixture of all generators consistently outperforms individual choices across different model architectures, suggesting corpus composition is more critical than generator selection.
The research addresses a fundamental challenge in developing time series foundation models: the unpredictable performance of synthetic data generators during pretraining. Traditional approaches assume finding the optimal generator yields the best results, but this study reveals generator effectiveness varies dramatically across model architectures, making universal recommendations impossible. The findings pivot the problem space away from generator selection toward corpus composition strategy.
The stability problem across architectures—where top-performing generators for one model fail for another—reflects the complex interaction between synthetic data characteristics and model learning dynamics. This architectural dependency suggests that generator quality itself may be less important than diversity within the training corpus. The simplicity of the equal-weight mixture approach is particularly noteworthy: it requires no complex optimization or architecture-specific tuning.
For practitioners developing time series models, this represents a practical guideline shift. Rather than investing engineering effort in identifying the single best generator, teams should allocate resources toward incorporating diverse synthetic sources. The combination of mixed synthetic data with real data creates robust pretraining corpora that generalize better than single-source approaches.
The research implications extend to foundation model development broadly. It suggests that pretraining corpus composition may be as crucial as model architecture or training procedure—a dimension often overlooked in benchmarking studies. Going forward, model evaluation frameworks should explicitly validate composition choices per architecture family rather than assuming transferability. This could reshape how organizations approach synthetic data strategy and pretraining pipeline design.
- →Generator selection creates up to 2× forecasting error variance, but optimal generators differ across model architectures
- →Equal-weight mixture of all synthetic generators matches or exceeds individual best performers on both tested architectures
- →Synthetic pretraining should be treated as corpus composition optimization rather than individual generator selection
- →Composition choices require per-architecture validation and do not transfer across different model families
- →Combining mixed synthetic data with real data produces the strongest pretraining corpora overall