🧠 AI⚪ NeutralImportance 5/10

Conceptual Schema Inference for Tabular Datasets using Large Language Models

arXiv – CS AI|Zhenyu Wu, Jiaoyan Chen, Norman W. Paton|May 27, 2026 at 04:00 AM

🤖AI Summary

Researchers propose LLM-based approaches (GeSI and EmSI) to automatically infer conceptual schemas from heterogeneous tabular datasets by analyzing column headers and cell values. The methods address the challenge of organizing large, inconsistent data collections from diverse sources by deriving entity types, attributes, and relationships without manual intervention.

Analysis

This research tackles a fundamental data management problem that affects organizations managing large repositories of disparate tabular data. Traditional schema inference relies on manual curation or rule-based systems, creating bottlenecks for enterprises handling data from multiple sources with inconsistent formatting and representation. The proposed LLM-based approaches—GeSI for generative schema inference and EmSI for embedding-based schema inference—leverage recent advances in large language models to automate this process, reducing time and human effort required for data integration tasks.

The significance of this work extends across data engineering and enterprise data management. Organizations increasingly struggle with data silos and inconsistent schemas across data lakes, web tables, and open data portals. Automated schema inference using LLMs offers practical benefits: faster integration of new data sources, improved data discoverability, and reduced errors from manual schema mapping. The authors demonstrate their approaches scale effectively to large repositories while maintaining schema quality and conciseness, addressing real operational constraints.

For data-intensive industries—financial services, healthcare, e-commerce—reliable automated schema inference directly impacts time-to-insight and reduces infrastructure costs. Developers building data platforms and integration tools could incorporate these LLM-based techniques to improve automation. The research validates that LLMs can understand semantic relationships in tabular data beyond simple pattern matching, opening possibilities for more intelligent data governance systems.

Future developments likely include integration of these techniques into commercial data integration platforms and exploration of how schema inference performs with domain-specific datasets. The work establishes LLMs as viable tools for metadata extraction, potentially influencing how enterprise data platforms evolve.

Key Takeaways

→LLM-based schema inference automates the derivation of entity types, attributes, and relationships from heterogeneous tabular datasets without manual intervention.
→GeSI uses generative models while EmSI employs embedding-based techniques, offering complementary approaches for different data integration scenarios.
→The methods scale effectively to large data repositories while maintaining schema quality, addressing enterprise data lake management challenges.
→Automated schema inference reduces the time and cost of integrating data from multiple inconsistent sources across organizations.
→The research demonstrates LLMs can understand semantic relationships in tabular data beyond pattern matching, enabling more intelligent data governance.