Towards Engineering Scaling Laws with Pretraining Data Composition
Researchers demonstrate that neural scaling laws in particle physics can be engineered by optimizing pretraining data composition, shifting computational requirements toward larger datasets rather than bigger models. By using more diverse and task-aligned synthetic data from physics simulators, the study shows improved scaling efficiency for hadronic jet classification, offering a template for other domains with access to high-fidelity generative systems.
This research addresses a fundamental challenge in machine learning: optimizing the trade-off between model size and dataset size when scaling systems. While scaling laws are well-understood in language models, particle physics presents a unique advantage—access to cheap synthetic data from precise physics simulators. The researchers exploit this by engineering the composition of pretraining data itself, demonstrating that careful dataset curation can shift scaling dynamics favorably.
The work builds on established scaling law research that emerged from large language models, where performance follows predictable power-law relationships with compute and parameters. However, physics simulations differ structurally from natural language or images, allowing researchers to directly control data quality and diversity rather than scraping internet text. The team applied this insight to hadronic jet classification, a particle physics task where careful data selection improved downstream performance more efficiently than simply increasing model parameters.
This methodology extends beyond particle physics. Any domain with access to high-fidelity simulators—robotics, climate modeling, materials science—could benefit from data-composition engineering. The approach offers practical value for organizations constrained by computational budgets or infrastructure limitations; scaling through better data may prove cheaper and faster than scaling through hardware. The findings also suggest that scaling laws are not immutable physical constraints but malleable through engineering choices.
Future research should explore whether data-composition strategies developed in physics transfer to other simulation-rich domains. Understanding which data characteristics most influence scaling could unlock more efficient training paradigms across AI applications.
- →Scaling laws in physics can be engineered toward data efficiency by optimizing pretraining data composition rather than simply increasing model size.
- →High-fidelity physics simulators enable cheap synthetic data generation, creating a different scaling regime than language or vision domains.
- →Diverse, task-aligned pretraining data improved hadronic jet classification scaling efficiency more than larger models.
- →The methodology could generalize to other simulation-rich domains like robotics, climate modeling, and materials science.
- →Data-composition engineering offers a cost-effective alternative to hardware scaling for resource-constrained AI development.