AINeutralarXiv – CS AI · 10h ago6/10
🧠
Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
Researchers introduce STEMO-Bench, a benchmark for evaluating video understanding in multimodal large language models (MLLMs), and propose STEMO-Track, a framework that reduces hallucinations by explicitly tracking object identities and states across time. The work addresses a critical limitation in current video AI systems: their inability to persistently monitor objects and temporal relationships in dynamic scenes.