Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
Researchers introduce STEMO-Bench, a benchmark for evaluating video understanding in multimodal large language models (MLLMs), and propose STEMO-Track, a framework that reduces hallucinations by explicitly tracking object identities and states across time. The work addresses a critical limitation in current video AI systems: their inability to persistently monitor objects and temporal relationships in dynamic scenes.
Current multimodal large language models struggle with video understanding despite recent advances, primarily due to weak spatio-temporal reasoning capabilities. This research identifies that existing evaluation methods mask these deficiencies by using single final-answer metrics that can be satisfied through superficial pattern matching or statistical shortcuts rather than genuine temporal understanding. The STEMO-Bench benchmark changes this by decomposing video queries into intermediate sub-questions that require persistent object tracking, forcing models to demonstrate consistent reasoning across temporal sequences.
The problem stems from how MLLMs process video information. Without explicit mechanisms to track individual objects through time, these models rely on local visual cues within individual frames, leading to hallucinations when queries require understanding state changes or causal relationships across multiple moments. This reflects a broader challenge in AI: bridging the gap between pattern recognition and genuine scene comprehension.
The proposed STEMO-Track framework addresses this by constructing explicit object trajectories through chunk-wise state extraction and temporal aggregation. This architectural approach forces the model to maintain coherent object identities and track their properties systematically. Experimental results demonstrate significant reductions in hallucinated outputs and improved consistency in spatio-temporal reasoning. For developers building video understanding systems, this work provides both diagnostic tools and architectural improvements relevant to applications in autonomous systems, video analysis, and multimodal AI deployment.
- βCurrent video MLLMs hallucinate due to inadequate spatio-temporal object tracking rather than fundamental architectural limitations.
- βSTEMO-Bench's decomposition method reveals hidden weaknesses by requiring intermediate reasoning steps rather than single-answer evaluations.
- βObject-centric trajectory tracking significantly reduces hallucinations and improves temporal reasoning consistency.
- βExisting benchmarks may overestimate video understanding capabilities by allowing models to exploit statistical priors.
- βExplicit structured object tracking is more effective than implicit temporal modeling for dynamic scene understanding.