CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning
CLASP is a modular robotic system that combines task-parameterized learning with vision-language models to enable robots to understand natural language commands while maintaining data efficiency. The approach achieves 73-100% success rates on manipulation tasks by learning skills from minimal demonstrations and composing them dynamically without fine-tuning the underlying models.
CLASP addresses a fundamental challenge in robotics: bridging the gap between data-efficient skill learning and intuitive natural language interaction. Traditional foundation models like VLMs and VLAs provide natural language grounding but demand extensive training data, while task-parameterized imitation learning achieves efficiency through minimal demonstrations but lacks language understanding. This work demonstrates that neither approach alone is sufficient; instead, a hybrid architecture leveraging pretrained models' language capabilities alongside efficient learning mechanisms creates a more practical system.
The technical contribution centers on combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained vision-language models. During the learning phase, robots acquire skills from just 2-5 kinesthetic demonstrations, with the VLM automatically generating schema descriptions of parameters and preconditions. Crucially, this process requires no fine-tuning of the foundation model, preserving its general knowledge while specializing its application.
The execution pipeline showcases practical reasoning: the VLM interprets natural language commands to select appropriate skills, bind parameters to task specifics, and compose multiple skills into novel behaviors through covariance weighting. When capability gaps emerge, the system identifies exactly which demonstrations are needed, enabling targeted active learning rather than generic data collection.
For the robotics industry, CLASP represents progress toward more accessible robot programming. The 73-100% success rates across skill selection, composition, and active learning scenarios suggest practical viability. The approach reduces the typical trade-off between sample efficiency and interpretability, enabling robots to work with limited data while responding to natural human commands, a critical requirement for real-world deployment in varied environments.
- βCLASP combines task-parameterized learning with pretrained VLMs to achieve both data efficiency and natural language grounding without fine-tuning.
- βThe system learns manipulation skills from just 2-5 kinesthetic demonstrations, generating skill schemas automatically via vision-language models.
- βNovel task behaviors are created through covariance-weighted composition of existing skills, expanding capability beyond learned primitives.
- βActive learning identifies capability gaps and requests targeted demonstrations, optimizing data collection efficiency.
- βValidation on a 7-DoF manipulator demonstrates 73-100% success rates across skill selection, composition, and continuous learning scenarios.