AINeutralarXiv – CS AI · 7h ago6/10
🧠
Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
Researchers propose WebGraphMix, a data selection framework that leverages web graph centrality scores to optimize pretraining data for language models without requiring labeled data or auxiliary classifiers. Testing on models up to 1B parameters shows that combining central and peripheral web regions in a 1:1 ratio improves performance to 41.4% versus 39.8% for uniform sampling, suggesting web topology captures complementary knowledge orthogonal to content-based approaches.