y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

arXiv – CS AI|Sukmin Seo, Geewook Kim|
🤖AI Summary

Researchers introduce ExtremeWhenBench, a benchmark for temporal grounding in hour-long videos using natural language queries. The study reveals that video-language models fail dramatically on long-form content because search—not recognition—is the bottleneck, with a hybrid retrieve-then-ground approach recovering 6.7x performance over monolithic models.

Analysis

This research identifies a fundamental architectural limitation in how current video-language models process extended content. Rather than struggling with localizing events within nearby frames, Video-LLMs fail at the search problem—efficiently finding relevant segments within hour-long videos. The ExtremeWhenBench dataset, containing 2,273 queries across 194 videos averaging 75.7 minutes, provides empirical evidence that 85% of model failures stem from search inefficiency rather than recognition errors.

The findings mirror established patterns in open-domain question answering, where retrieve-then-read architectures consistently outperform end-to-end approaches. The complete collapse of monolithic Video-LLMs on this task suggests current scaling approaches—simply increasing model capacity—are insufficient for long-form video understanding. A simple frame-level retrieval baseline outperforms these sophisticated models, indicating that decomposing the problem into search and grounding stages is more effective than unified processing.

This work has implications for video understanding systems deployed in practical applications like video search engines, security monitoring, and content curation platforms. Organizations building these systems must consider two-stage architectures rather than relying on increasingly large vision-language models. The 6.7x recovery rate through hybrid approaches demonstrates that engineering architectural choices matter more than raw model capacity.

Future research should explore efficient retrieval mechanisms for video, potentially leveraging semantic embeddings, temporal indexing, or sparse sampling strategies. The benchmark itself becomes valuable infrastructure for the community, establishing a new evaluation standard for long-form video understanding that current approaches systematically fail.

Key Takeaways
  • Search capability, not visual recognition, is the limiting factor for hour-scale video temporal grounding.
  • Monolithic Video-LLMs completely fail on ExtremeWhenBench while simple retrieval baselines outperform them significantly.
  • A hybrid retrieve-then-ground approach recovers 6.7x performance over end-to-end video-language models.
  • 85% of model failures on long videos are attributed to search problems rather than localization errors.
  • The retrieve-then-read pattern from open-domain QA effectively transfers to long-form video understanding tasks.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles