AINeutralApple Machine Learning ยท Feb 245/103
๐ง
Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
Researchers investigate whether using a single HTML-to-text extractor for web-scale LLM pretraining datasets leads to suboptimal data utilization. The study reveals that different extractors can result in substantially different pages surviving filtering pipelines, despite similar model performance on standard language tasks.