KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free system that converts long robot execution videos into compact, interpretable tokens for vision-language models to analyze robot failures. The approach combines keyframe extraction, open-vocabulary detection, and bird's-eye-view spatial representations to enable failure detection, identification, localization, and correction without requiring model fine-tuning.
KITE addresses a critical gap in robot autonomy: the ability to systematically analyze and learn from failures at scale. Traditional approaches to robot failure analysis either require manual annotation or fine-tuned models that struggle with distribution shifts across different robots and environments. By creating a structured, interpretable intermediate representation between raw video and VLM reasoning, KITE makes failure analysis more accessible and transferable.
The system's architecture reveals important trends in AI systems design. Rather than end-to-end training, KITE leverages pre-trained components—motion detection, open-vocabulary vision models, and VLMs—through a carefully engineered tokenization scheme. This composable approach reflects broader industry movement toward modular AI systems that can adapt without retraining. The bird's-eye-view representation particularly demonstrates sophisticated thinking about how to present spatial information in ways VLMs can reliably reason about.
For robotics development, KITE's training-free performance on RoboFAC benchmarks has immediate practical implications. Teams can deploy failure analysis pipelines without expensive annotation campaigns or model customization. The reported gains on simulation-to-real transfer are particularly significant, as sim-to-real gaps remain a major bottleneck in robot deployment. Qualitative results on dual-arm robots suggest the approach generalizes beyond single-arm scenarios.
Looking forward, this work validates VLMs as general-purpose reasoning engines for robotics when paired with appropriate input representations. The open-sourcing of code and models likely accelerates adoption across robotics labs, potentially establishing KITE-like pipelines as standard infrastructure for autonomous systems development and debugging.
- →KITE enables training-free robot failure analysis by tokenizing video evidence into interpretable representations for off-the-shelf VLMs
- →The system combines keyframe extraction, open-vocabulary detection, and spatial bird's-eye-view layouts to represent robot trajectories compactly
- →Performance on RoboFAC benchmarks shows substantial improvements over vanilla VLM approaches, particularly for simulation-based failure detection
- →Training-free design allows deployment across different robots and environments without task-specific fine-tuning or manual annotation
- →Open-sourced implementation and models reduce barriers for robotics teams to adopt systematic failure analysis practices