Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code
Researchers present a method to verify that LLM-generated simulation code solves the intended physics equations, not just that it executes successfully. They introduce Intent Fidelity Score (IFS) to structurally compare generated PDEs against user intent, and demonstrate on 220 multiphysics cases that execution-only validation misses 39-40% of cases solving incorrect physics.
The research exposes a critical blind spot in AI code generation evaluation: execution success does not guarantee correctness for scientific simulation. While a generated MOOSE input file may run, mesh, and converge numerically, it can encode fundamentally different governing equations than intended. This comprehension-generation gap represents a silent failure mode invisible to standard testing practices.
The work builds on the compositional structure of MOOSE's Kernel and BC objects to deterministically reconstruct encoded PDEs and compare them against formal intent contracts. The Intent Fidelity Score provides granular structural metrics covering governing terms, boundary conditions, initial conditions, coefficients, and time schemes. Testing on MooseBench, a 220-case benchmark with PDE-level ground truth, demonstrates that direct LLM generation consistently produces correct code at lower rates than execution suggests. On harder cases where direct generation fails (IFS < 0.7), iterative refinement using deterministic violation reports recovers +0.22 to +0.41 absolute IFS.
The deployment audit reveals that execution-only repair strategies, commonly used in practice, improve runability while leaving roughly 40% of cases solving incorrect physics. This separates executability from intent fidelity as distinct failure modes. Proof-of-concept experiments across four PDE-oriented domain-specific languages (UFL/FEniCS, FreeFEM, FiPy, Devito) suggest the reconstruction pattern generalizes beyond MOOSE.
For scientific computing and engineering applications relying on LLM-generated simulation code, this work establishes that mathematical correctness requires verification against encoded physics structure, not acceptance based on successful execution alone. Organizations deploying AI for scientific simulation must adopt verification layers that validate semantic correctness at the PDE level.
- βExecutable simulation code frequently encodes incorrect physics while passing execution tests, creating a hidden failure mode in LLM-generated scientific code.
- βIntent Fidelity Score provides deterministic structural verification of PDEs by reconstructing equations from code and comparing against formal intent contracts.
- βIterative refinement using PDE-level violation reports recovers 0.22-0.41 absolute IFS on hard cases where direct generation fails below 0.7 fidelity.
- βExecution-only repair strategies leave 39-40% of generated simulation code solving wrong physics despite successful convergence.
- βPDE-grounded verification patterns generalize across multiple domain-specific languages including FEniCS, FreeFEM, FiPy, and Devito.