#video-understanding News & Analysis

36 articles tagged with #video-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

36 articles

AINeutralarXiv – CS AI · 4d ago7/10

🧠

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Researchers introduce EgoBench, a new benchmark for evaluating AI agents' ability to perceive visual information, reason through multi-step tasks, and interact with users in real-world scenarios. Testing eight state-of-the-art video models reveals significant limitations, with the best performer achieving only 30.62% accuracy, exposing critical gaps in current AI agent capabilities.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

Researchers introduce Tensor Memory, a fixed-size recurrent module that augments Transformers with persistent 3D spatial state for improved long-sequence processing. The approach enables better video understanding and occlusion reasoning by decoupling memory capacity from input length while maintaining computational efficiency.

AIBullisharXiv – CS AI · May 127/10

🧠

Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

Researchers introduce MAGIC-Video, a training-free framework that enables multimodal AI systems to process and reason about ultra-long videos spanning days or weeks by combining a structured memory graph with narrative chains. The system outperforms existing baselines on multiple benchmarks, addressing a critical limitation where current LLMs can only handle tens of minutes of video despite having million-token context windows.

AIBullisharXiv – CS AI · May 127/10

🧠

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

Researchers introduce HY-Himmel, a hierarchical video-language framework that efficiently processes long videos by separating semantic and motion encoding tasks. The system uses sparse keyframes for visual grounding while a lightweight adapter extracts motion information from compressed video data, achieving better performance than dense-frame baselines while reducing token usage by 3.6x.

AIBullisharXiv – CS AI · May 117/10

🧠

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

Researchers introduce Video Understanding Reward Bench (VURB), a comprehensive benchmark with 2,100 preference pairs for evaluating video reward models, alongside VUP-35K, a large-scale dataset of 35,000 preference examples. Two new models, VideoDRM and VideoGRM, achieve state-of-the-art performance on video understanding tasks, advancing multimodal AI capabilities beyond text and images.

AIBullisharXiv – CS AI · Apr 147/10

🧠

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder is a new machine learning method that learns dense reward signals from passive videos to improve reinforcement learning in robotics. By modeling temporal distances between video frames, the approach achieves 90% success rates on Meta-World tasks using significantly fewer environment interactions than prior methods, while also leveraging human videos for scalable reward learning.

AINeutralarXiv – CS AI · Mar 267/10

🧠

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Researchers propose DIG, a training-free framework that improves long-form video understanding by adapting frame selection strategies based on query types. The system uses uniform sampling for global queries and specialized selection for localized queries, achieving better performance than existing methods while scaling to 256 input frames.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new open-source family of vision-language models that achieves state-of-the-art performance among open models, particularly excelling in video understanding and pixel-level grounding tasks. The research introduces 7 new video datasets and 2 multi-image datasets collected without using proprietary VLMs, along with an 8B parameter model that outperforms existing open-weight models and even some proprietary models on specific tasks.

AINeutralarXiv – CS AI · 16h ago6/10

🧠

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

Researchers introduce CaptionFormer, an end-to-end model that simultaneously detects, segments, tracks, and captions objects in video sequences. The work addresses Dense Video Object Captioning by generating synthetic training data using vision-language models and extends existing datasets, achieving state-of-the-art results across multiple benchmarks.

AINeutralarXiv – CS AI · 16h ago5/10

🧠

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

ConTrans, a novel neural network architecture, advances zero-shot temporal action localization by combining convolutional and transformer layers to capture both local frame dependencies and long-range video context. The approach achieves new benchmark performance on standard datasets, addressing limitations in existing methods that underutilize local correlations between frames.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Researchers propose a unified framework for long-form egocentric video understanding that separates reasoning into semantic and visual evidence streams, achieving competitive results on the HD-EPIC-VQA benchmark. The approach addresses fundamental limitations in how multimodal language models process extended video content by combining procedural structure extraction with fine-grained object grounding.

AINeutralarXiv – CS AI · 4d ago5/10

🧠

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Researchers introduce the Video Important Person (VIP) identification task and Temporal-VIP dataset to automatically identify key individuals in video scenes while addressing the Temporal Importance Shift phenomenon. The VIP-Net framework achieves 67.3% accuracy, significantly outperforming existing methods (37.5%-53.9%), with applications in automated video editing and intelligent surveillance.

🏢 Hugging Face

AIBullisharXiv – CS AI · 4d ago6/10

🧠

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

Researchers introduce MOV-Bench, a benchmark for evaluating multi-hop audio-visual reasoning in large language models, and propose AOP-Agent, an agentic framework that enables open-source multimodal LLMs to perform active perception across temporally dispersed audio and visual evidence without additional training.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism introduces a heterogeneous Mixture-of-Experts framework that enhances Vision-Language Models for video understanding by deploying specialized experts rather than identical generalists. The approach uses dynamic multi-rate sampling and bidirectional fusion to achieve state-of-the-art performance on video recognition benchmarks.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

Researchers propose a novel game-theoretic approach to weakly-supervised video temporal grounding that models video frames and query words as cooperative game players to improve moment localization. The method addresses limitations in existing contrastive learning approaches by enabling fine-grained cross-modal interaction without relying on complex moment proposals, demonstrating superior performance on benchmark datasets.

AINeutralarXiv – CS AI · 5d ago6/10

🧠

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

Researchers introduce DynFrame, an advanced video understanding framework that enables multimodal language models to dynamically select both temporal windows and frame sampling rates during inference. The approach achieves competitive performance with smaller 4B models against larger 7B-8B baselines and sets new state-of-the-art results with its 8B variant across six video understanding benchmarks.

AINeutralarXiv – CS AI · May 126/10

🧠

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

Researchers propose Grounded Correspondence, a new framework for video object tracking that replaces learned prediction models with deterministic bipartite matching. By leveraging existing vision backbone features, the approach achieves competitive results without learnable temporal parameters, challenging the conventional approach of using dynamics modules for temporal consistency.

AINeutralarXiv – CS AI · May 126/10

🧠

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

Researchers introduce STEMO-Bench, a benchmark for evaluating video understanding in multimodal large language models (MLLMs), and propose STEMO-Track, a framework that reduces hallucinations by explicitly tracking object identities and states across time. The work addresses a critical limitation in current video AI systems: their inability to persistently monitor objects and temporal relationships in dynamic scenes.

AIBullisharXiv – CS AI · May 116/10

🧠

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

SAVEMem is a training-free framework that improves real-time video understanding by incorporating semantic awareness into memory management rather than relying solely on visual similarity. The system achieves significant performance gains on streaming video benchmarks while reducing GPU memory consumption by 48%, demonstrating practical advances in efficient AI model inference.

AINeutralarXiv – CS AI · May 116/10

🧠

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Response-G1 introduces a novel framework for real-time video understanding that uses explicit scene graphs to align video evidence with query-specific response conditions, enabling Video-LLMs to make more accurate timing decisions during streaming video analysis without requiring fine-tuning.

AIBullisharXiv – CS AI · May 96/10

🧠

Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement

Researchers introduce NOVA, a world modeling framework that represents scene state as weights in implicit neural representations (INRs) rather than traditional encoded latent spaces. The approach eliminates decoder bottlenecks, achieves structural disentanglement of scene components, and enables controllable video generation on consumer GPUs with only 40M parameters.

AIBullisharXiv – CS AI · Apr 146/10

🧠

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Researchers introduce BoxTuning, a novel approach for improving video understanding in multimodal AI models by rendering object bounding boxes directly onto video frames as visual prompts rather than encoding them as text tokens. The method achieves 87-93% reduction in text token usage while maintaining full temporal resolution, demonstrating superior performance on video question-answering tasks.

AINeutralarXiv – CS AI · Apr 146/10

🧠

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Researchers introduced HumanVBench, a comprehensive benchmark for evaluating how well multimodal AI models understand human-centric video content across 16 tasks including emotion recognition and speech-visual alignment. The study evaluated 30 leading MLLMs and found significant performance gaps, even among top proprietary models, while introducing automated synthesis pipelines to enable scalable benchmark creation with minimal human effort.

AINeutralarXiv – CS AI · Apr 136/10

🧠

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Researchers introduce AV-SpeakerBench, a new 3,212-question benchmark designed to evaluate how well multimodal large language models understand audiovisual speech by correlating speakers with their dialogue and timing. Testing reveals Gemini 2.5 Pro significantly outperforms open-source competitors, with the gap primarily attributable to inferior audiovisual fusion capabilities rather than visual perception limitations.

🧠 Gemini

AIBullisharXiv – CS AI · Apr 66/10

🧠

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Researchers propose a fully end-to-end training paradigm for temporal sentence grounding in videos, introducing the Sentence Conditioned Adapter (SCADA) to better align video understanding with natural language queries. The method outperforms existing approaches by jointly optimizing video backbones and localization components rather than using frozen pre-trained encoders.

Page 1 of 2Next →