AINeutralarXiv – CS AI · 9h ago6/10
🧠
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
Researchers introduce VIGIL, an evaluation framework that separately measures whether embodied AI agents correctly complete tasks and properly report success, rather than conflating execution failures with commitment failures. Testing across 20 models reveals significant performance gaps in terminal commitment despite similar task execution, highlighting a critical blind spot in current AI agent benchmarking.