🧠 AI🔴 BearishImportance 7/10

Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety

arXiv – CS AI|S M Jamil Uddin|April 15, 2026 at 04:00 AM

🤖AI Summary

Researchers empirically evaluated 450 LLM-generated Python scripts for construction safety and found alarming reliability gaps, including a 45% silent failure rate where code executes but produces mathematically incorrect safety outputs. The study demonstrates that current frontier LLMs lack the deterministic rigor required for autonomous safety-critical engineering applications, necessitating human oversight and governance frameworks.

Analysis

This research addresses a critical vulnerability in the emerging practice of "vibe coding"—allowing non-technical users to generate executable safety-critical code through natural language prompts to LLMs. The empirical findings expose a dangerous divergence between syntactic reliability and logical correctness: while 85% of generated scripts executed without crashes, nearly half produced mathematically flawed safety calculations that would compound real-world construction hazards. This distinction between compilation success and functional accuracy represents a fundamental challenge for AI-assisted development in high-stakes domains.

The study's bifurcated evaluation methodology—combining sandboxed execution with LLM-based logic assessment—reveals that prompt formality directly correlates with hallucination rates. Less structured requests trigger the models' tendency to fabricate missing safety variables, a pattern consistent across Claude, GPT-4o-Mini, and Gemini 2.5 Flash. GPT-4o-Mini showed particularly concerning performance, with 56% of its functional code containing mathematical errors.

For the broader AI development ecosystem, this research validates skepticism about deploying LLMs as standalone engineering tools in safety-critical contexts. Construction, healthcare, and cyber-physical systems cannot tolerate 45% failure rates in logic, regardless of execution success. The findings imply that enterprise adoption requires deterministic AI wrappers—validation layers that verify outputs against domain specifications rather than trusting model coherence.

Looking forward, this work will likely accelerate development of verification frameworks and hybrid human-AI workflows for safety engineering. Organizations considering LLM-based tool generation in regulated industries should implement rigorous testing protocols and restrict autonomous deployment to non-critical functions.

Key Takeaways

→LLM-generated code shows 85% syntactic viability but 45% silent failure rate in construction safety logic, masking dangerous mathematical errors
→Prompt formality significantly impacts output reliability, with informal requests triggering higher rates of fabricated safety variables
→GPT-4o-Mini produced mathematically inaccurate outputs in 56% of successfully executing scripts, the worst performance among tested models
→Current frontier LLMs lack deterministic rigor for autonomous safety-critical engineering and require human verification and governance frameworks
→Enterprises deploying LLM-generated code in regulated industries must implement validation layers and restrict autonomous use to non-critical functions

Mentioned in AI

Models

GPT-4OpenAI

ClaudeAnthropic

GeminiGoogle

#llm-safety #code-generation #vibe-coding #construction-safety #ai-reliability #silent-failures #deterministic-ai #cyber-physical-systems

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge