MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
MiraBench introduces a new evaluation framework for robotic world models that prioritizes action-conditioned reliability over visual fidelity. The benchmark reveals that current AI models struggle to faithfully follow commanded actions and exhibit persistent optimism bias when predicting outcomes of failure-inducing actions.
MiraBench addresses a critical gap in how the AI research community evaluates robotic world models. While existing benchmarks focus heavily on visual realism—whether predicted images look convincing—they largely ignore whether those predictions correspond to physically plausible outcomes and whether the model respects the actions it's conditioned on. This distinction matters fundamentally for robotics deployment, where a visually perfect but physically incorrect prediction can lead to failed tasks or unsafe robot behavior. The benchmark's hierarchical approach—moving from basic physics adherence to action-following fidelity to optimism bias detection—creates a diagnostic ladder that reveals specific failure modes rather than aggregate performance scores.
The research reveals three counterintuitive findings with implications for model development. Visual quality doesn't correlate with action reliability, meaning models can appear convincing while fundamentally misunderstanding how actions affect the world. Scaling model size doesn't automatically improve action following, challenging common assumptions about bigger-is-better in AI development. Most critically, optimism bias—the tendency to predict success regardless of whether actions should fail—pervades even leading systems. This systematic failure mode suggests current training approaches don't adequately penalize unrealistic success predictions.
For the robotics and AI communities, MiraBench provides essential diagnostic infrastructure for building reliable simulators. Rather than chasing visual benchmarks, developers can now target action-conditioned reliability, potentially redirecting research efforts toward physically grounded learning. The evaluation framework establishes new standards for what "good" world model performance means, influencing how future systems are trained and validated for real-world deployment.
- →Visual fidelity is a poor predictor of action-conditioned reliability in robotic world models
- →Current state-of-the-art models exhibit persistent optimism bias across 12 tested configurations
- →Larger model scales do not reliably improve action-following capabilities
- →MiraBench's three-level hierarchy provides diagnostic foundation for identifying specific failure modes
- →Action-conditioned reliability must become a primary evaluation target instead of appearance-based metrics