AINeutralarXiv – CS AI · 11h ago6/10
🧠
EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Researchers introduce EgoExo-Con, a benchmark testing whether video language models maintain consistent temporal understanding across different camera viewpoints of the same event. The study reveals that existing Video-LLMs struggle with cross-view consistency and proposes View-GRPO, a reinforcement learning framework to improve temporal reasoning across viewpoints.