y0news
AnalyticsDigestsSourcesRSSAICrypto
#dataset-construction1 article
1 articles
AINeutralApple Machine Learning ยท Feb 245/103
๐Ÿง 

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining

Researchers investigate whether using a single HTML-to-text extractor for web-scale LLM pretraining datasets leads to suboptimal data utilization. The study reveals that different extractors can result in substantially different pages surviving filtering pipelines, despite similar model performance on standard language tasks.