Attacking the Trusted Imagination: Oracle-Level Integrity Attacks on Imagine-then-Act World Models
Researchers demonstrate a novel attack vector against vision-language-action (VLA) policies that exploit the 'trusted imagination' component of world-action models rather than targeting reactive policies directly. By perturbing observations to corrupt latent trajectory predictions, attackers can fool downstream systems like safety gates and MPC planners while leaving the base policy unaffected, revealing a critical asymmetry in AI system robustness.
This research exposes a fundamental architectural vulnerability in modern AI systems that decompose decision-making into imagination and action phases. The attack exploits an implicit trust assumption: downstream components assume the world model's predictions accurately reflect future states. By contaminating the latent trajectory representation with imperceptible perturbations, attackers can trigger failures in systems that depend on these predictions, even when the reactive policy itself remains robust. This breaks a common security assumption that hardening one component translates to system-wide resilience.
The work addresses a growing class of vision-language-action models that separate planning from execution. Recent systems like RynnVLA-002 and LaDi-WM adopt this paradigm for modularity and interpretability. However, this separation creates an overlooked attack surface: the imagination itself becomes a critical trust boundary. The research demonstrates that corrupting the latent trajectory representation requires minimal perturbations (60x stronger than random noise) while remaining imperceptible to human observation.
For AI safety and autonomous systems communities, this finding has immediate implications. Systems relying on intermediate representations from machine learning models may inherit vulnerabilities not apparent in component-level testing. Safety gates and model-predictive controllers that consume world model outputs require additional verification mechanisms. The parameter-free denoiser detector proposed achieves AUC 1.0 on untargeted corruption, suggesting detection is feasible, though adaptive adversaries can evade detection by maintaining the perturbation within behavioral bounds.
The research highlights that robust AI deployment demands threat modeling beyond traditional adversarial robustness. Organizations integrating world models into safety-critical systems should implement representation-level verification and consider the threat model where intermediary predictions themselves become attack vectors rather than privileged internal signals.
- →World-action models' latent trajectory predictions represent an overlooked but critical attack surface distinct from downstream policy robustness
- →Minimal L-infinity-bounded perturbations can corrupt imagination outputs while remaining imperceptible, with untargeted attacks 60x stronger than random noise
- →Downstream systems using corrupted predictions—including MPC planners—exhibit task failure rates dropping from 70% to 5% at minimal perturbation levels
- →A parameter-free denoiser detector can identify untargeted corruption with perfect AUC, though adaptive attackers can evade detection by controlling perturbation magnitude
- →System-level robustness does not guarantee component robustness when intermediate representations are consumed by safety-critical downstream systems