LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition
LUCID is a machine learning framework that learns robot manipulation skills from unstructured internet videos and human demonstrations, then transfers this knowledge to different robot embodiments through a shared intent model. The approach eliminates the need for expensive, embodiment-specific robot training data and demonstrates zero-shot transfer capabilities across multiple real-world tasks.
LUCID addresses a fundamental bottleneck in robot learning: the high cost and limited scalability of collecting embodiment-specific training data. Traditional robot learning requires extensive demonstrations from actual robots or heavily structured human data, constraining both the volume and diversity of training examples. This work leverages internet-scale video datasets as a primary training source, dramatically expanding available training data while reducing collection costs.
The framework's innovation lies in its two-stage architecture that decouples task intent from embodiment-specific control. An intent model learns what actions should occur next in a scene using diverse human videos, while separate sensorimotor policies translate this abstract intent into robot-specific commands. This separation enables the same intent model to control different robotic systems—from dexterous hands to parallel-jaw grippers—addressing a critical challenge in robot generalization.
The demonstrated capabilities are noteworthy: the system achieves manipulation tasks like stirring and wiping supervised only by internet video, with zero-shot transfer to novel objects and scenes. Tasks like cable routing require minimal additional data (one hour of smartphone video), suggesting the approach gracefully scales between extreme data scarcity and moderate data availability. This flexibility makes LUCID particularly valuable for developing robot applications where high-quality training data remains prohibitively expensive.
For the robotics industry, LUCID's approach could accelerate the development of generalizable manipulation systems by reducing dependence on custom hardware-specific datasets. As robotic deployment increases across manufacturing and service sectors, the ability to leverage internet-scale human video as training data represents a meaningful step toward practical, scalable robot learning pipelines.
- →LUCID learns robot skills from internet videos rather than expensive robot-specific demonstrations, dramatically reducing data collection costs.
- →The intent-policy separation enables a single learned model to control different robot embodiments without retraining.
- →Zero-shot transfer to novel objects and scenes demonstrates generalization capabilities beyond training distribution.
- →The framework requires minimal additional data (1 hour of video) for new tasks beyond those trained on internet scale data.
- →This approach addresses a critical bottleneck in robot learning by decoupling learning from embodiment constraints.