Researchers introduce DeGenTWeb, a systematic methodology for identifying websites dominated by LLM-generated content with minimal human input. The study reveals that LLM-dominant sites are significantly more prevalent across the web than previously understood, with detection accuracy declining as LLM capabilities improve, raising questions about content authenticity and search quality.
DeGenTWeb addresses a critical gap in understanding the true prevalence of AI-generated content online. Previous claims about LLM takeover lacked representative sampling and transparent methodology, leaving stakeholders uncertain about the actual scale of the problem. This research provides a rigorous framework for detecting and categorizing LLM-dominant websites at scale, revealing that these sites are far more common than widely reported, appearing both in Common Crawl datasets and Bing search results with growing prevalence over time.
The broader context reflects mounting concerns about content authenticity in an era of advanced generative AI. As LLMs become more sophisticated and accessible, the economic incentives for automated content generation—particularly for SEO manipulation, affiliate marketing, and low-effort publishing—have intensified. Search engines and content platforms face mounting pressure to distinguish human-authored material from machine-generated alternatives.
For stakeholders ranging from search engines to content platforms to users, this research has significant implications. Search quality degrades when LLM-generated content dominates results, undermining user trust and platform credibility. Content creators and legitimate publishers face increased competition from low-cost automated alternatives, potentially reshaping content economics. The finding that detection becomes increasingly difficult with advancing LLM capabilities suggests this problem may accelerate faster than solutions can be developed.
The research points toward an arms race between detection and generation capabilities. As LLMs improve at mimicking human writing, maintaining accurate site-level categorization requires continually updated detection methods. Platforms may need to implement additional signals beyond text analysis—such as behavioral patterns, editorial practices, or cryptographic verification—to authenticate human-generated content reliably.
- →LLM-dominant websites are significantly more prevalent on the web than previously documented, with growing prevalence in both Common Crawl and Bing search results.
- →Current LLM detection methods perform substantially worse in practice than advertised, particularly when avoiding false positives on human-written content.
- →The technical challenge of identifying LLM-generated content is becoming harder as latest-generation LLMs improve at mimicking human writing styles.
- →Systematic detection at scale requires aggregating multiple page-level detections for accurate site-level categorization, not simple per-page analysis.
- →The research reveals a critical blind spot in understanding web content authenticity, with implications for search quality and content platform credibility.