BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling
Researchers introduce BIM-Edit, a benchmark that evaluates large language models on their ability to edit existing Building Information Models in IFC format based on natural language instructions. The benchmark reveals significant capability gaps, with the best-performing LLM achieving only 49.5% accuracy and none solving more than 3.4% of tasks, highlighting that current AI systems struggle with the semantic preservation and relational understanding required for professional engineering workflows.
The introduction of BIM-Edit addresses a critical blind spot in LLM evaluation for engineering applications. While recent research has focused heavily on LLMs generating new design artifacts from text prompts, professional engineering practice demands far more sophisticated capabilities—specifically the ability to understand existing complex models, modify them precisely, and maintain the intricate semantic relationships that define building systems. This distinction matters enormously because editing existing infrastructure is orders of magnitude more common than creating designs from scratch.
The benchmark's comprehensive design reflects real-world engineering demands. By organizing 324 tasks across geometric, semantic, and topological dimensions, the researchers move beyond simplistic correctness metrics. A model might generate geometrically accurate modifications that violate building codes or break structural relationships—failures invisible to crude benchmarks. The inclusion of spatial and topological instruction categories tests whether LLMs grasp the interdependencies inherent in building systems.
The results expose a substantial performance ceiling. A 49.5% best-case score across all metrics indicates that even state-of-the-art models lack fundamental understanding of structured design constraints. This carries direct implications for the CAD software industry and firms betting on AI-assisted engineering workflows. Companies exploring LLM integration into professional tools cannot yet rely on autonomous modifications without extensive human verification, limiting productivity gains and market applications.
These findings should temper near-term expectations for AI in engineering while establishing a useful directional benchmark for future development. The gap between current capabilities and production requirements remains severe enough to prevent mainstream adoption in high-stakes structural environments where errors carry financial and safety consequences.
- →Current LLMs achieve only 49.5% average performance on structured building model editing, revealing critical limitations for engineering applications.
- →The benchmark evaluates three distinct dimensions—geometric accuracy, semantic validity, and topological consistency—capturing complexities missing from simpler design benchmarks.
- →No evaluated LLM successfully completed more than 3.4% of tasks, indicating fundamental gaps in understanding interdependent relationships within complex systems.
- →Editing existing models requires semantic awareness and relational understanding that differs fundamentally from generating new designs from scratch.
- →Results suggest AI-assisted engineering tools cannot yet operate autonomously in high-stakes environments without extensive human verification.