AINeutralarXiv – CS AI · 9h ago6/10
🧠
Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate
Researchers introduce Mage, a multi-axis evaluation framework that reveals compile-pass rate is a misleading metric for assessing LLM-generated code in complex domains. Testing across four open-weight language models on game scene synthesis, they find direct code generation achieves 43% runtime success but produces structurally invalid outputs, while IR-conditioned approaches recover functional correctness at the cost of lower raw execution rates.