#video-understanding News & Analysis

67 articles tagged with #video-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

67 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture

Researchers propose P-JEPA, a new video representation learning architecture that processes procedural videos over 30 minutes long by reducing complexity through dense action prediction. The method achieves state-of-the-art results on multiple benchmarks while using significantly fewer parameters than LLM-based approaches and enabling real-time inference.

AIBullisharXiv – CS AI · Jun 237/10

🧠

XmoPipe: A Pipeline for Large-Scale In-the-Wild Human Motion Dataset Construction

XmoPipe is a scalable pipeline that constructs large-scale human motion datasets by extracting 3D body and facial motion from unconstrained online videos, combined with automated textual descriptions. The system demonstrates that motion models trained on this in-the-wild data achieve performance comparable to traditional marker-based motion capture datasets while offering superior scalability and diversity.

AIBullisharXiv – CS AI · Jun 237/10

🧠

VideoLatent: Video-Language Learning via Latent Self-Forcing

Researchers introduce VideoLatent, a multimodal language model that performs efficient visual reasoning on videos without requiring labor-intensive chain-of-thought annotations. The model uses a novel latent self-forcing training paradigm and achieves superior performance across 14 benchmarks while reducing computational overhead by 6-68x compared to existing methods.

AIBullisharXiv – CS AI · Jun 117/10

🧠

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

LUCID is a machine learning framework that learns robot manipulation skills from unstructured internet videos and human demonstrations, then transfers this knowledge to different robot embodiments through a shared intent model. The approach eliminates the need for expensive, embodiment-specific robot training data and demonstrates zero-shot transfer capabilities across multiple real-world tasks.

AIBearisharXiv – CS AI · Jun 97/10

🧠

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

Researchers have identified a critical reliability flaw in multimodal large language models (MLLMs) used for video understanding: when the correct answer is absent from available options, these models fail to recognize it and instead select plausible incorrect alternatives. Testing across multiple models and benchmarks reveals this limitation is especially severe in temporal reasoning tasks and worsens with increased video frame sampling, with chain-of-thought prompting offering only partial mitigation.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

Researchers introduce LyraV, a streaming video-language model that maintains real-time synchronization between video perception and language generation without pausing. The system uses a hierarchical control framework with two key components—a Frame-Driven Transition Controller and Streaming Token Pacer—to interleave video frames with generated tokens at 3.89 FPS with 98.29% synchrony.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Researchers introduce GeoVR, a framework that enhances multimodal large language models with 3D spatial awareness by learning geometric representations from 2D video sequences. Using four complementary geometric targets including camera pose estimation, depth mapping, and 3D feature distillation, the approach achieves state-of-the-art performance on spatial reasoning benchmarks without requiring large-scale 3D training data.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Researchers introduce Active Video Perception (AVP), an AI framework that enables agents to actively seek relevant evidence in long videos rather than passively processing entire content. The system uses an iterative plan-observe-reflect process to achieve superior accuracy on five benchmarks while reducing inference time by 82% and token usage by 88% compared to existing agentic methods.

AINeutralarXiv – CS AI · Jun 47/10

🧠

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Researchers introduce M³Eval, the first comprehensive benchmark for evaluating memory capabilities in multi-modal AI models processing long-form video. Testing across multiple models reveals significant weaknesses in maintaining disentangled representations, handling temporal information, and symbolic memory—highlighting memory as a critical yet understudied dimension of AI development.

AIBearisharXiv – CS AI · Jun 27/10

🧠

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Researchers introduce PaSBench-Video, a 740-video benchmark designed to evaluate multimodal large language models' ability to issue timely safety warnings in streaming video scenarios. Testing 13 MLLMs reveals that no model exceeds 20% accuracy on strict metrics, with models struggling to distinguish emerging hazards from routine activities, particularly in driving scenarios where safe and dangerous scenes appear visually similar.

AIBullisharXiv – CS AI · Jun 27/10

🧠

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Researchers introduce StreamingVLM, a vision-language model designed to process infinite video streams in real-time without excessive computational costs. The model uses a compact KV cache and supervised fine-tuning on overlapped video chunks to maintain stable performance up to 8 FPS, outperforming GPT-4O mini on a new benchmark featuring videos over two hours long.

🏢 Nvidia🧠 GPT-4

AIBullisharXiv – CS AI · May 287/10

🧠

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

Researchers introduce Tensor Memory, a fixed-size recurrent module that augments Transformers with persistent 3D spatial state for improved long-sequence processing. The approach enables better video understanding and occlusion reasoning by decoupling memory capacity from input length while maintaining computational efficiency.

AINeutralarXiv – CS AI · May 287/10

🧠

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Researchers introduce EgoBench, a new benchmark for evaluating AI agents' ability to perceive visual information, reason through multi-step tasks, and interact with users in real-world scenarios. Testing eight state-of-the-art video models reveals significant limitations, with the best performer achieving only 30.62% accuracy, exposing critical gaps in current AI agent capabilities.

AIBullisharXiv – CS AI · May 127/10

🧠

Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

Researchers introduce MAGIC-Video, a training-free framework that enables multimodal AI systems to process and reason about ultra-long videos spanning days or weeks by combining a structured memory graph with narrative chains. The system outperforms existing baselines on multiple benchmarks, addressing a critical limitation where current LLMs can only handle tens of minutes of video despite having million-token context windows.

AIBullisharXiv – CS AI · May 127/10

🧠

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

Researchers introduce HY-Himmel, a hierarchical video-language framework that efficiently processes long videos by separating semantic and motion encoding tasks. The system uses sparse keyframes for visual grounding while a lightweight adapter extracts motion information from compressed video data, achieving better performance than dense-frame baselines while reducing token usage by 3.6x.

AIBullisharXiv – CS AI · May 117/10

🧠

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

Researchers introduce Video Understanding Reward Bench (VURB), a comprehensive benchmark with 2,100 preference pairs for evaluating video reward models, alongside VUP-35K, a large-scale dataset of 35,000 preference examples. Two new models, VideoDRM and VideoGRM, achieve state-of-the-art performance on video understanding tasks, advancing multimodal AI capabilities beyond text and images.

AIBullisharXiv – CS AI · Apr 147/10

🧠

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder is a new machine learning method that learns dense reward signals from passive videos to improve reinforcement learning in robotics. By modeling temporal distances between video frames, the approach achieves 90% success rates on Meta-World tasks using significantly fewer environment interactions than prior methods, while also leveraging human videos for scalable reward learning.

AINeutralarXiv – CS AI · Mar 267/10

🧠

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Researchers propose DIG, a training-free framework that improves long-form video understanding by adapting frame selection strategies based on query types. The system uses uniform sampling for global queries and specialized selection for localized queries, achieving better performance than existing methods while scaling to 256 input frames.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new open-source family of vision-language models that achieves state-of-the-art performance among open models, particularly excelling in video understanding and pixel-level grounding tasks. The research introduces 7 new video datasets and 2 multi-image datasets collected without using proprietary VLMs, along with an 8B parameter model that outperforms existing open-weight models and even some proprietary models on specific tasks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

MotionHalluc: Diagnosing Kinematic Hallucinations in Fine-Grained Motion Reasoning

Researchers introduce MotionHalluc, a benchmark dataset for evaluating how AI models hallucinate when analyzing motion differences between paired videos. The study reveals that large multimodal models struggle with directional, attributional, and temporal hallucinations in motion reasoning, but shows that injecting explicit kinematic measurements can improve accuracy by 10.6%.

AINeutralarXiv – CS AI · Jun 236/10

🧠

HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning

Researchers introduce Hierarchical Programmatic Probing (HPP), a framework that separates visual perception from temporal reasoning in long video understanding by enabling coding-capable language models to iteratively probe videos through programmatic exploration. The approach decouples perception and reasoning tasks that traditional vision-language models attempt to handle simultaneously, demonstrating significant improvements across multiple long-video benchmarks including LongVideoBench, EgoSchema, and VideoMME.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Rethinking Object-Centric Representations for Video Dynamics Modeling

Researchers introduce STAITUS, a machine learning framework that improves unsupervised video object tracking by explicitly separating appearance features from geometric pose information in slot-based representations. The approach addresses a fundamental problem where enforcing temporal consistency causes models to mistrack moving objects and fragment identities, achieving superior performance on tracking stability and segmentation quality.

AINeutralarXiv – CS AI · Jun 236/10

🧠

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Researchers introduce EgoExo-Con, a benchmark testing whether video language models maintain consistent temporal understanding across different camera viewpoints of the same event. The study reveals that existing Video-LLMs struggle with cross-view consistency and proposes View-GRPO, a reinforcement learning framework to improve temporal reasoning across viewpoints.

AIBullisharXiv – CS AI · Jun 116/10

🧠

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Researchers propose ReRe, a training-free framework that improves spatial reasoning in egocentric videos by having multimodal AI models first form a hypothesis, then revise it using synthesized novel viewpoints. The approach demonstrates significant performance gains on spatial reasoning benchmarks without modifying existing model architectures.

AIBullisharXiv – CS AI · Jun 116/10

🧠

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

Researchers introduce MultiToP, a framework that reduces hallucinations in video language models by selectively replacing unreliable visual tokens before text generation. The method achieves 50.60% F1 score improvement on hallucination benchmarks while maintaining general video understanding performance, demonstrating that targeted token refinement can enhance multimodal AI reliability without modifying base models.

Page 1 of 3Next →