Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
Researchers propose WebGraphMix, a data selection framework that leverages web graph centrality scores to optimize pretraining data for language models without requiring labeled data or auxiliary classifiers. Testing on models up to 1B parameters shows that combining central and peripheral web regions in a 1:1 ratio improves performance to 41.4% versus 39.8% for uniform sampling, suggesting web topology captures complementary knowledge orthogonal to content-based approaches.
WebGraphMix addresses a fundamental challenge in large language model development: determining which pretraining data produces optimal model performance. Rather than relying on computationally expensive auxiliary classifiers or quality scoring systems, the researchers exploit the structural properties of the Common Crawl web graph itself. This approach reflects a broader shift toward leveraging inherent properties of training data rather than adding external models to curate it.
The distinction between central and peripheral web hosts maps to an intuitive hypothesis—central hosts encode broad, reusable abstractions while peripheral hosts preserve specialized, long-tail knowledge. The empirical validation across 23 tasks demonstrates this complementarity, with the 1:1 mixture outperforming uniform sampling. The improvement to 43.8% when combined with document-level quality scores indicates that structural and content-based signals remain largely independent, suggesting room for further optimization through multi-axis curation strategies.
For AI development stakeholders, this work reduces the computational barrier to effective data curation. Organizations can now leverage web graph topology without training additional models or sourcing labeled datasets. This democratizes data selection methodology, enabling smaller teams to achieve performance competitive with larger-scale approaches. The efficiency gains compound across thousands of training runs, making this technique particularly valuable as model scales continue increasing.
- →WebGraphMix uses web graph centrality to select pretraining data without model training or labeled supervision, reducing computational overhead
- →Central and peripheral web regions encode complementary capabilities, with 1:1 mixture achieving 41.4% performance versus 39.8% for uniform sampling
- →Web topology captures information orthogonal to content-based quality signals, enabling additive improvements when combined
- →The framework scales efficiently to web-scale data selection without downstream task supervision
- →Results suggest structural properties of training data deserve equal attention to content analysis in curation strategies