y0news
← Feed
←Back to feed
🧠 AIβšͺ NeutralImportance 7/10

Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code

arXiv – CS AI|Zhenghan Song, Yulong Liu, Cheng Wan, Chenjun Li, Lingfu Liu, Yunyi Li, Congcong Yuan|
πŸ€–AI Summary

Researchers present a method to verify that LLM-generated simulation code solves the intended physics equations, not just that it executes successfully. They introduce Intent Fidelity Score (IFS) to structurally compare generated PDEs against user intent, and demonstrate on 220 multiphysics cases that execution-only validation misses 39-40% of cases solving incorrect physics.

Analysis

The research exposes a critical blind spot in AI code generation evaluation: execution success does not guarantee correctness for scientific simulation. While a generated MOOSE input file may run, mesh, and converge numerically, it can encode fundamentally different governing equations than intended. This comprehension-generation gap represents a silent failure mode invisible to standard testing practices.

The work builds on the compositional structure of MOOSE's Kernel and BC objects to deterministically reconstruct encoded PDEs and compare them against formal intent contracts. The Intent Fidelity Score provides granular structural metrics covering governing terms, boundary conditions, initial conditions, coefficients, and time schemes. Testing on MooseBench, a 220-case benchmark with PDE-level ground truth, demonstrates that direct LLM generation consistently produces correct code at lower rates than execution suggests. On harder cases where direct generation fails (IFS < 0.7), iterative refinement using deterministic violation reports recovers +0.22 to +0.41 absolute IFS.

The deployment audit reveals that execution-only repair strategies, commonly used in practice, improve runability while leaving roughly 40% of cases solving incorrect physics. This separates executability from intent fidelity as distinct failure modes. Proof-of-concept experiments across four PDE-oriented domain-specific languages (UFL/FEniCS, FreeFEM, FiPy, Devito) suggest the reconstruction pattern generalizes beyond MOOSE.

For scientific computing and engineering applications relying on LLM-generated simulation code, this work establishes that mathematical correctness requires verification against encoded physics structure, not acceptance based on successful execution alone. Organizations deploying AI for scientific simulation must adopt verification layers that validate semantic correctness at the PDE level.

Key Takeaways
  • β†’Executable simulation code frequently encodes incorrect physics while passing execution tests, creating a hidden failure mode in LLM-generated scientific code.
  • β†’Intent Fidelity Score provides deterministic structural verification of PDEs by reconstructing equations from code and comparing against formal intent contracts.
  • β†’Iterative refinement using PDE-level violation reports recovers 0.22-0.41 absolute IFS on hard cases where direct generation fails below 0.7 fidelity.
  • β†’Execution-only repair strategies leave 39-40% of generated simulation code solving wrong physics despite successful convergence.
  • β†’PDE-grounded verification patterns generalize across multiple domain-specific languages including FEniCS, FreeFEM, FiPy, and Devito.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles