🧠 AI🟢 BullishImportance 7/10

MemoryVAM: Integrating Memory into Video Action Model for Robot Manipulation

arXiv – CS AI|Yuxin Jiang, Chang Yu, Yunuo Chen, Xiang Feng, Yin Yang, Nishank Gite, Chenfanfu Jiang|June 23, 2026 at 04:00 AM

🤖AI Summary

MemoryVAM introduces an episodic memory mechanism for video-world-model policies that enables robots to perform long-horizon manipulation tasks by retaining and leveraging historical context. The system achieves significant performance improvements on benchmark tasks and real robot experiments, addressing a fundamental limitation where short observation windows make complex manipulation non-Markovian.

Analysis

MemoryVAM addresses a critical constraint in robotic manipulation: video-world-model policies traditionally condition on short observation windows, making them unable to handle tasks requiring memory of past events. This research bridges that gap through a Recap-Cue module that compresses per-frame CLIP embeddings into compact memory tokens, enabling policies to condition actions on full episode history rather than immediate observations alone. The architecture proves flexible, compatible with both UNet and Diffusion Transformer backbones through simple cross-attention modifications.

The technical innovation lies in how memory is integrated without requiring per-frame progress labels. Instead, the system trains jointly using video prediction, delta-reconstruction auxiliary loss, and episode-boundary supervision—a more practical approach than densely annotated training data. This design choice matters because annotation costs represent a major barrier to scaling robotic learning systems.

Benchmark results demonstrate substantial improvements: LIBERO-Mem performance jumped from 5% to 42.5% success rate, while real robot experiments achieved 78.3% on counting, 80.0% on spatial recall, and 75.0% on sequential tracking. These results indicate genuine progress toward robots handling complex, multi-step tasks that depend on earlier observations. The work has implications for autonomous systems development and demonstrates how transformer-based architectures can be adapted to incorporate temporal memory effectively.

Future developments should focus on scaling memory mechanisms to longer episodes, reducing computational overhead of memory injection, and testing generalization across diverse task distributions beyond current benchmarks.

Key Takeaways

→MemoryVAM integrates episodic memory into video-world-model policies, enabling robots to perform long-horizon tasks requiring historical context.
→The Recap-Cue module compresses per-frame embeddings into compact memory tokens injected into both video backbones and action decoders.
→LIBERO-Mem success rate improved from 5% to 42.5%, with real robot tasks achieving 75-80% success across counting, spatial, and sequential tasks.
→The architecture works with multiple backbone types (UNet, Diffusion Transformer) by modifying only the cross-attention injection interface.
→Training uses video prediction and auxiliary losses rather than per-frame progress labels, reducing annotation requirements for scalability.