Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval
Researchers present an automated system that discovers executable schemas from multi-source, heterogeneous data and uses them as a unified contract for knowledge graph construction and intelligent query routing. The approach combines LLM-based schema discovery with deterministic structural analysis and demonstrates improved retrieval performance across four QA benchmarks compared to baseline methods.
This research addresses a fundamental challenge in data integration: unifying information across tables, documents, and semi-structured sources with conflicting schemas and formats. Traditional approaches either require expensive manual schema engineering or abandon structure altogether, limiting the quality of downstream retrieval and reasoning tasks. The proposed system automates schema discovery through a constrained field catalog that prevents LLM hallucinations, then applies deterministic analysis to identify keys and hierarchies—creating a semantic contract that governs how data flows through the pipeline.
The innovation extends beyond discovery into practical application. At query time, the schema conditions a multi-tool agent that intelligently routes requests across structured lookup, graph traversal, and vector search, synthesizing results with traceable provenance. This is particularly valuable for knowledge graph construction, where structural intelligence dramatically improves deduplication and entity linking across heterogeneous sources. The ablation studies demonstrate that each component—schema-conditioned routing, structural analysis, and schema-guided construction—independently contributes to performance gains, suggesting the approach is robust rather than reliant on any single technique.
For organizations managing real-world data lakes, this work suggests that automated schema discovery could reduce integration costs while improving retrieval quality. The system's applicability across multiple QA benchmarks indicates generalizability beyond narrow use cases. The emphasis on deterministic analysis alongside LLM capabilities offers a pragmatic middle ground between full automation and manual curation, potentially making enterprise data integration more scalable and maintainable.
- →Automated schema discovery from multi-source data creates a unified contract for knowledge graph construction and intelligent query routing.
- →Combining LLM-based discovery with deterministic structural analysis prevents hallucinations and infers critical database relationships automatically.
- →Schema-conditioned routing at query time outperforms retrieval-only and decomposition-based baselines across four QA benchmarks.
- →The system maintains provenance awareness, enabling traceable citations and grounded answers in cross-source retrieval.
- →Ablation studies confirm that schema-conditioned routing, structural intelligence, and schema-guided construction each independently improve performance.