Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity
Researchers developed a multimodal generative AI pipeline that creates synthetic residential building datasets from publicly available county records and images, addressing critical data scarcity challenges in building energy modeling. The system achieves over 65% overlap with national reference data, enabling scalable energy research and urban simulations without relying on expensive or privacy-restricted datasets.
This research tackles a fundamental bottleneck in computational energy modeling: the shortage of accessible building parameter data. Traditional approaches require extensive on-site surveys, proprietary databases, or datasets restricted by privacy regulations—barriers that have historically limited the scale of building-scale energy research and urban planning initiatives. The multimodal framework combines vision-language models, tabular data processing, and simulation components to synthesize realistic building characteristics from already-public sources, fundamentally changing the economics of energy research.
The validation methodology deserves attention. Rather than relying solely on visual inspection, the team employed occlusion-based analysis to measure which image features the model genuinely uses, revealing that their selected vision-language model outperforms GPT-based alternatives at building interpretation. The 65%+ overlap with national reference datasets across all parameters, and 90%+ for specific metrics, suggests the synthetic data achieves meaningful fidelity without requiring proprietary or sensitive information.
For the building science and urban planning sectors, this work eliminates a major cost barrier to machine learning adoption. Municipalities and research institutions can now conduct energy retrofit analysis, baseline energy assessments, and urban-scale simulations at scale without negotiating data access agreements or funding expensive surveys. This democratization effect extends to emerging economies and resource-constrained regions where building databases are particularly sparse.
The framework's modular design suggests future extensibility—similar approaches could address data scarcity in other infrastructure domains. Success here may catalyze adoption of synthetic data pipelines across urban computing, climate modeling, and infrastructure resilience planning. Monitoring implementation by municipal governments and energy utilities will indicate real-world applicability.
- →Multimodal AI pipeline generates realistic building datasets from public county records and images, reducing reliance on expensive or privacy-restricted data sources
- →Synthetic data achieves 65%+ parameter overlap with national reference datasets, validating practical utility for energy modeling applications
- →Occlusion-based visual focus analysis demonstrates superior performance of vision-language models over GPT variants for building image processing
- →Framework enables scalable downstream applications including energy modeling, retrofit analysis, and urban-scale simulations previously constrained by data scarcity
- →Democratized data access lowers barriers for building-scale research in municipal governments and resource-constrained regions globally