AIBullisharXiv – CS AI · 6h ago7/10
🧠
From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data
A comprehensive survey examines how human videos can be leveraged to train Vision-Language-Action (VLA) models for robot manipulation, addressing the limitation that robot demonstrations are expensive and embodiment-specific. The research categorizes four approaches for extracting actionable knowledge from human videos and identifies critical open challenges in video structuring, embodiment transfer, and real-world evaluation.