🧠 AI🟢 BullishImportance 6/10

ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

arXiv – CS AI|William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan|May 11, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce ScrapeGraphAI-100k, a large-scale dataset of 93,695 real-world schema-constrained extraction events collected from production use. The dataset addresses a critical gap in AI training by pairing actual web content with JSON schemas, prompts, and LLM responses, enabling better evaluation and training of models for structured data extraction tasks.

Analysis

Schema-constrained generation—the ability of language models to produce output conforming to specified JSON structures—has become fundamental to AI tool use and data extraction workflows. Yet the field lacks adequate training datasets grounded in real-world usage patterns. ScrapeGraphAI-100k fills this gap by collecting 93,695 deduplicated extraction events from actual production telemetry across Q2-Q3 2025, representing 18,000+ unique schemas in 15 languages. This represents a significant methodological advance over prior synthetic or text-only corpora that poorly reflect practitioner needs.

The dataset's construction reflects practical constraints faced by AI developers. Each instance includes Markdown-converted page content, the original prompt, target schema, the LLM's response, and structural conformance labels via jsonschema-rs validation. The corpus reveals important empirical findings: structural diversity varies considerably across schemas, and model performance degrades sharply as schema complexity increases—a pattern invisible in synthetic benchmarks. The researchers demonstrate this through a distillation case study where a 1.7B student model trained on the dataset approximates GPT-5-nano's output distribution, though it underperforms a 30B reference model with 3.3B active parameters on strict schema compliance.

For the AI industry, this dataset enables more rigorous benchmarking and training of specialized models for structured extraction. Organizations building extraction pipelines can now evaluate models against real-world complexity distributions rather than synthetic tasks. The preliminary distillation results suggest that grounding schema-constrained generation in authentic workloads enables training approaches previously infeasible with limited or artificial data. This work likely accelerates adoption of smaller, fine-tuned models for extraction tasks, reducing deployment costs while improving reliability.

Key Takeaways

→ScrapeGraphAI-100k provides 93,695 real-world schema-constrained extraction examples, addressing a major dataset gap in structured LLM generation research.
→The dataset spans 18,000+ unique JSON schemas across 15 languages, enabling evaluation of model robustness across linguistic and structural diversity.
→Empirical findings show sharp performance degradation as schema complexity increases, a pattern invisible in prior synthetic benchmarks.
→A 1.7B student model fine-tuned on the dataset tracks GPT-5-nano output distribution, suggesting viable paths to smaller, cost-effective extraction models.
→Real production telemetry grounds AI training in authentic practitioner workloads, enabling more reliable model evaluation than text-only corpora.

Mentioned in AI

Models

GPT-5OpenAI