AINeutralarXiv – CS AI · 7h ago6/10
🧠
FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs
Researchers introduce FEM-Bench, a scientific reasoning benchmark designed to evaluate large language models' ability to generate correct finite element method (FEM) code for computational mechanics problems. Despite the simplicity of introductory-level tasks, current state-of-the-art LLMs show inconsistent performance, with Gemini 3 Pro completing 30/33 tasks at least once and GPT-5 achieving 73.8% success on unit test writing.
🧠 GPT-5🧠 Gemini