AIBullisharXiv โ CS AI ยท 10h ago7/10
๐ง
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
Researchers developed MegaScale-Data, an industrial-grade distributed data loading architecture that significantly improves training efficiency for large foundation models using multiple data sources. The system achieves up to 4.5x training throughput improvement and 13.5x reduction in CPU memory usage through disaggregated preprocessing and centralized data orchestration.