🧠 AI⚪ NeutralImportance 6/10

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

arXiv – CS AI|Hugh Xuechen Liu, K{\i}van\c{c} Tatar|May 11, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Mage, a multi-axis evaluation framework that reveals compile-pass rate is a misleading metric for assessing LLM-generated code in complex domains. Testing across four open-weight language models on game scene synthesis, they find direct code generation achieves 43% runtime success but produces structurally invalid outputs, while IR-conditioned approaches recover functional correctness at the cost of lower raw execution rates.

Analysis

The study challenges a fundamental assumption in LLM code generation evaluation: that compilation success reliably indicates functional correctness. The researchers discovered a critical divergence when testing language models on executable game scene synthesis, where traditional compile-pass metrics actively misrepresent model performance. This matters because it exposes a widespread evaluation blind spot across AI development—metrics optimized for simplicity may obscure whether generated code actually works as intended.

The four-axis Mage framework (compile success, runtime success, structural fidelity, mechanism adherence) reveals nuanced trade-offs hidden by single-metric evaluation. Direct natural-language-to-C# generation exhibits high compile rates but near-zero mechanism fidelity (F₁ ≈ 0.12), meaning code compiles without capturing intended behavior. Conversely, intermediate representation (IR) conditioning recovers structural validity (F₁ up to 1.00) despite lower runtime rates, demonstrating that architectural choices fundamentally alter what gets optimized versus what gets compromised.

For AI developers and researchers, this finding suggests current benchmarking practices systematically misallocate credit and blame. Models appearing strong on standard metrics may fail in deployment scenarios requiring behavioral correctness. The saturation point between behavior-only and full-scene IR granularity (p = 1.0) additionally indicates that input-level improvements plateau without architectural changes. This research methodology—releasing benchmark data, replay logs, and per-record metrics—sets a reproducibility standard that could reshape how code generation quality gets measured across domains beyond game development, particularly in safety-critical applications where compile success is meaningless without execution correctness.

Key Takeaways

→Compile-pass rate is anti-correlated with functional correctness in domain-specific code generation tasks
→Intermediate representation conditioning trades raw runtime success for structural and behavioral fidelity recovery
→Multi-axis evaluation frameworks are necessary to detect performance divergence hidden by single-metric assessment
→Open benchmarks with replay logs and per-record metrics enable independent verification and reproducible evaluation
→Input granularity improvements show diminishing returns without corresponding architectural innovations

#llm-evaluation #code-generation #benchmark-methodology #structural-correctness #game-synthesis #ir-conditioning #reproducibility

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AI4d ago

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AI4d ago

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AI5d ago

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge