GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation
GSAM is a new robotic framework that improves articulated object manipulation through vision-based perception, VLM-based refinement with commonsense reasoning, and constraint-based planning to prevent collisions. In experiments across 50 hinge tasks, GSAM achieved 36% higher success rates and 3.1% lower standard deviation compared to existing baselines, demonstrating superior generalization and safety.
GSAM addresses a critical gap in robotic manipulation by tackling the challenge of generalizing across diverse articulated objects while maintaining safety during interaction. Traditional approaches relying solely on end-to-end learning, vision-motion planning, or language models struggle with the geometric complexity of different handle-object configurations and the risk of destructive collisions during manipulation attempts.
The framework's innovation lies in its modular architecture combining multiple AI techniques. A vision-based perceiver generates initial kinematic parameters, which a fine-tuned VLM refiner then polishes using chain-of-thought reasoning—essentially teaching the system to apply common sense rather than relying on raw perception outputs. This hybrid approach acknowledges that pure learning systems often miss practical constraints humans naturally consider. The interaction constraint function generator represents a significant advance, embedding knowledge about articulated objects, interaction geometry, and obstacle avoidance into a unified framework that an LLM converts into actionable constraints for motion planning.
The 36% improvement in manipulation success rate carries substantial implications for real-world robotics deployment. Service robots operating in homes and workplaces frequently encounter hinged objects—cabinets, doors, drawers—making this generalization capability directly applicable. The reduced standard deviation indicates more reliable performance, reducing costly failures and property damage. For roboticists and robot manufacturers, this demonstrates that combining classical constraint-based planning with modern language model reasoning outperforms pure learning approaches in safety-critical scenarios. As robotic systems increasingly integrate into human environments, frameworks that prioritize interaction safety while maintaining generalization will become essential market differentiators.
- →GSAM combines vision perception, VLM refinement with commonsense reasoning, and constraint-based planning for safer articulated object manipulation
- →The framework achieved 36% higher success rates and 3.1% lower standard deviation compared to existing baselines across diverse testing scenarios
- →VLM-based perception refinement using chain-of-thought reasoning improves accuracy beyond raw marker-based estimates
- →Constraint function generation prevents destructive collisions by integrating articulated object properties and obstacle avoidance knowledge
- →The modular architecture demonstrates that hybrid AI approaches combining language models with classical planning outperform end-to-end learning for safety-critical robotics tasks