Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate
Researchers introduce Mage, a multi-axis evaluation framework that reveals compile-pass rate is a misleading metric for assessing LLM-generated code in complex domains. Testing across four open-weight language models on game scene synthesis, they find direct code generation achieves 43% runtime success but produces structurally invalid outputs, while IR-conditioned approaches recover functional correctness at the cost of lower raw execution rates.
The study challenges a fundamental assumption in LLM code generation evaluation: that compilation success reliably indicates functional correctness. The researchers discovered a critical divergence when testing language models on executable game scene synthesis, where traditional compile-pass metrics actively misrepresent model performance. This matters because it exposes a widespread evaluation blind spot across AI development—metrics optimized for simplicity may obscure whether generated code actually works as intended.
The four-axis Mage framework (compile success, runtime success, structural fidelity, mechanism adherence) reveals nuanced trade-offs hidden by single-metric evaluation. Direct natural-language-to-C# generation exhibits high compile rates but near-zero mechanism fidelity (F₁ ≈ 0.12), meaning code compiles without capturing intended behavior. Conversely, intermediate representation (IR) conditioning recovers structural validity (F₁ up to 1.00) despite lower runtime rates, demonstrating that architectural choices fundamentally alter what gets optimized versus what gets compromised.
For AI developers and researchers, this finding suggests current benchmarking practices systematically misallocate credit and blame. Models appearing strong on standard metrics may fail in deployment scenarios requiring behavioral correctness. The saturation point between behavior-only and full-scene IR granularity (p = 1.0) additionally indicates that input-level improvements plateau without architectural changes. This research methodology—releasing benchmark data, replay logs, and per-record metrics—sets a reproducibility standard that could reshape how code generation quality gets measured across domains beyond game development, particularly in safety-critical applications where compile success is meaningless without execution correctness.
- →Compile-pass rate is anti-correlated with functional correctness in domain-specific code generation tasks
- →Intermediate representation conditioning trades raw runtime success for structural and behavioral fidelity recovery
- →Multi-axis evaluation frameworks are necessary to detect performance divergence hidden by single-metric assessment
- →Open benchmarks with replay logs and per-record metrics enable independent verification and reproducible evaluation
- →Input granularity improvements show diminishing returns without corresponding architectural innovations