y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

arXiv – CS AI|Tiantian Mi, Dongming Shan, Zhen Huang, Yiwei Qin, Muhang Xie, Yuxuan Qiao, Yixiu Liu, Chenyang Zhou, Pengfei Liu|
🤖AI Summary

Researchers introduced DataEvolve, an AI framework that autonomously evolves data curation strategies for pretraining datasets through iterative optimization. The system processed 672B tokens to create Darwin-CC dataset, which achieved superior performance compared to existing datasets like DCLM and FineWeb-Edu when training 3B parameter models.

Key Takeaways
  • DataEvolve automates the evolution of data curation strategies through closed-loop optimization, eliminating the need for manual design at scale.
  • The framework processed 8 categories spanning 672B tokens to produce Darwin-CC, a 504B-token optimized dataset.
  • Models trained on Darwin-CC outperformed raw data by 3.96 points and achieved 44.13 average score across 18 benchmarks.
  • Evolved strategies converged on cleaning-focused approaches with targeted noise removal and domain-aware preservation.
  • Ablation studies confirmed that optimized strategies outperform suboptimal ones by 2.93 points, validating the evolutionary approach.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles