Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Researchers introduce Moment-Video, a benchmark revealing that current video multimodal large language models (MLLMs) struggle to understand brief, momentary visual events that last only a few frames. Testing 33 models shows the best achieves only 39.6% accuracy, exposing a critical gap in temporal fidelity that persists despite advances in general video understanding.