AINeutralarXiv โ CS AI ยท 1d ago7/10
๐ง
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Researchers propose DIG, a training-free framework that improves long-form video understanding by adapting frame selection strategies based on query types. The system uses uniform sampling for global queries and specialized selection for localized queries, achieving better performance than existing methods while scaling to 256 input frames.