AIBullisharXiv โ CS AI ยท 7h ago6/10
๐ง
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free system that converts long robot execution videos into compact, interpretable tokens for vision-language models to analyze robot failures. The approach combines keyframe extraction, open-vocabulary detection, and bird's-eye-view spatial representations to enable failure detection, identification, localization, and correction without requiring model fine-tuning.