Researchers introduce ASH, an agentic system that learns embodied policies from unlabeled internet video without reward shaping or expert demonstration. Through a self-improvement loop using Inverse Dynamics Models, ASH achieves sustained progression on long-horizon tasks in Pokemon Emerald and Legend of Zelda, significantly outperforming baseline approaches.
ASH represents a meaningful advance in autonomous agent development by demonstrating that self-improvement mechanisms can overcome scalability limitations inherent in traditional reinforcement learning approaches. The system's core innovation—leveraging Inverse Dynamics Models to extract supervision from internet video when agents encounter impasses—addresses a fundamental bottleneck: the prohibitive cost of hand-engineering rewards or collecting expert demonstrations for complex, long-horizon tasks.
The research builds on years of investigation into unsupervised and self-supervised learning for embodied AI. Previous work has explored learning from videos and behavioral cloning, but struggled with distribution shift and plateau effects in extended tasks. ASH's integration of long-term memory via unsupervised key-moment identification from internet-scale video adds a scalability dimension often absent in prior work. The experimental setup—using video games with multi-hour planning horizons—provides a controlled yet challenging testbed that bridges simulation and real-world complexity.
The performance gap is substantial: ASH maintains progression across 8-hour evaluations while baselines stagnate at roughly 50-55% of milestone completion, compared to ASH's 93-83% achievement rates. This suggests that self-correcting mechanisms genuinely enable agents to move past local optima. For the AI development ecosystem, ASH indicates that abundance of internet video—currently underutilized in training—can substitute for expensive expert annotation, potentially accelerating agent capability scaling.
The framework's broader implications extend beyond games. If the approach generalizes to robotics and real-world embodied tasks, it could significantly reduce training costs and human oversight requirements. Future research should validate transfer to continuous control domains and quantify data efficiency gains compared to supervised baselines.
- →ASH learns embodied policies from unlabeled internet video without reward shaping or expert annotations, addressing scalability bottlenecks in current AI systems.
- →The self-improvement loop using Inverse Dynamics Models allows agents to extract relevant supervision when stuck, enabling sustained progression on long-horizon tasks.
- →ASH achieves 93% and 83% milestone completion rates in Pokemon Emerald and Zelda respectively, versus 54-50% for strongest baselines.
- →Unsupervised key-moment identification from internet-scale video enables effective long-term memory for multi-hour planning horizons.
- →The approach suggests internet video abundance can substitute for expensive expert demonstrations, potentially accelerating agent capability scaling.