AIBullisharXiv – CS AI · 9h ago6/10
🧠
ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation
Researchers introduce ScrapeGraphAI-100k, a large-scale dataset of 93,695 real-world schema-constrained extraction events collected from production use. The dataset addresses a critical gap in AI training by pairing actual web content with JSON schemas, prompts, and LLM responses, enabling better evaluation and training of models for structured data extraction tasks.
🧠 GPT-5