AINeutralarXiv – CS AI · 7h ago6/10
🧠
Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition
Researchers introduce ExtremeWhenBench, a benchmark for temporal grounding in hour-long videos using natural language queries. The study reveals that video-language models fail dramatically on long-form content because search—not recognition—is the bottleneck, with a hybrid retrieve-then-ground approach recovering 6.7x performance over monolithic models.