AIBullisharXiv โ CS AI ยท 7h ago7/10
๐ง
Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation
Researchers introduced DataEvolve, an AI framework that autonomously evolves data curation strategies for pretraining datasets through iterative optimization. The system processed 672B tokens to create Darwin-CC dataset, which achieved superior performance compared to existing datasets like DCLM and FineWeb-Edu when training 3B parameter models.