Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Researchers introduce Moment-Video, a benchmark revealing that current video multimodal large language models (MLLMs) struggle to understand brief, momentary visual events that last only a few frames. Testing 33 models shows the best achieves only 39.6% accuracy, exposing a critical gap in temporal fidelity that persists despite advances in general video understanding.
Video MLLMs have demonstrated impressive capabilities in processing and understanding long-form video content, yet a fundamental weakness has gone largely unexamined: their ability to capture fleeting visual moments that are critical to answering specific questions. The Moment-Video benchmark directly addresses this gap by testing 33 models—both proprietary systems like Seed-2.0-Pro and open-source alternatives—on their capacity to detect, count, and reason about transient visual events. The results are sobering, with the top performer achieving only 39.6% accuracy and most open-source models falling below 25%, indicating a systemic limitation in how these systems process temporal information.
This weakness stems from architectural decisions that optimize for efficiency rather than precision. Sparse frame sampling, visual-token compression, and coarse temporal aggregation all prioritize computational speed but sacrifice the fine-grained temporal awareness needed to capture momentary events. The benchmark's diagnostic analysis reveals that denser frame sampling provides marginal improvements but fails to eliminate the core bottleneck, suggesting the problem runs deeper than sampling strategy alone.
For developers building video AI applications, this research highlights a critical reliability issue. Applications requiring accurate detection of brief actions—from surveillance systems to autonomous vehicle perception—cannot depend on current MLLMs without architectural innovations. The finding that longer videos intensify temporal-localization challenges compounds the problem, as real-world deployments often involve extended footage. Industry stakeholders must prioritize developing temporally faithful representations before deploying these models in safety-critical scenarios.
- →Current video MLLMs achieve below 40% accuracy on momentary visual events despite strong general video understanding capabilities.
- →Sparse frame sampling and visual compression are primary culprits but denser sampling alone cannot resolve the temporal fidelity gap.
- →Open-source models significantly underperform proprietary systems, with most scoring below 25% on the Moment-Video benchmark.
- →Brief but critical visual evidence lasting only a few frames represents a fundamental weakness that language-side reasoning cannot compensate for.
- →Extended video length amplifies temporal-localization challenges, creating practical limitations for real-world deployment scenarios.