DTBench: A Synthetic Benchmark for Document-to-Table Extraction
Researchers introduce DTBench, a synthetic benchmark for evaluating large language models on document-to-table extraction tasks. Using a reverse Table2Doc synthesis approach with multi-agent workflows, the benchmark covers 13 subcategories across 5 major capability areas, revealing significant performance gaps and persistent challenges in reasoning and conflict resolution across mainstream LLMs.
DTBench addresses a critical gap in AI evaluation infrastructure by creating a systematic framework for assessing document-to-table extraction capabilities. Rather than relying on costly human-annotated datasets, the researchers employ a novel reverse paradigm that generates synthetic documents from ground-truth tables using multi-agent workflows, enabling scalable benchmark construction with comprehensive capability coverage.
The benchmark's two-level taxonomy approach is methodologically significant because previous evaluation frameworks failed to distinguish between different extraction capabilities. Doc2Table extraction demands diverse competencies including schema alignment, reasoning over implicit information, and conflict resolution—yet existing benchmarks treated these as monolithic tasks. DTBench's granular categorization into 13 subcategories enables researchers to pinpoint specific model weaknesses rather than receiving aggregate scores.
The evaluation results across mainstream LLMs demonstrate substantial performance variance and expose three persistent failure modes: reasoning limitations when documents require multi-step inference, faithfulness issues where models hallucinate table entries, and conflict resolution gaps when source documents contain contradictory information. These findings carry implications for enterprise data extraction applications, where structured output reliability directly impacts downstream SQL-based analytics and business decision-making.
For the AI development community, DTBench establishes a publicly available evaluation standard that should accelerate progress on document understanding. Organizations deploying LLM-based data extraction pipelines can now benchmark solutions against capability-specific metrics rather than relying on proprietary evaluations. The synthetic generation methodology itself provides a reusable template for constructing evaluation benchmarks in other information extraction domains.
- →DTBench introduces the first capability-aware synthetic benchmark covering 13 subcategories of document-to-table extraction tasks.
- →Mainstream LLMs show substantial performance gaps, particularly in reasoning, faithfulness, and conflict resolution capabilities.
- →The reverse Table2Doc synthesis approach enables scalable benchmark generation without expensive human annotation.
- →The benchmark provides structured evaluation framework applicable to enterprise data extraction and SQL analytics workflows.
- →Publicly available benchmark accelerates research progress on LLM-based information extraction and structured data generation.