#video-understanding News & Analysis

67 articles tagged with #video-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

67 articles

AINeutralarXiv – CS AI · Jun 116/10

🧠

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

Researchers present a three-stage pipeline for zero-shot accident detection in surveillance videos that combines temporal localization, semantic classification, and spatial grounding using vision-language models. The method decomposes accident understanding into when, what, and where components, achieving significant improvements over baseline approaches on the ACCIDENT benchmark.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

Researchers introduce ExtremeWhenBench, a benchmark for temporal grounding in hour-long videos using natural language queries. The study reveals that video-language models fail dramatically on long-form content because search—not recognition—is the bottleneck, with a hybrid retrieve-then-ground approach recovering 6.7x performance over monolithic models.

AIBullisharXiv – CS AI · Jun 96/10

🧠

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

Researchers introduce MOSS-Video-Preview, a cross-attention architecture enabling real-time video understanding where models process frames continuously and revise answers as new information arrives. The approach achieves 5x speedup in time-to-first-token and 2.7x higher decoding throughput compared to decoder-only models, while maintaining competitive offline performance.

AINeutralarXiv – CS AI · Jun 95/10

🧠

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

Researchers present a training-free Video RAG (Retrieval-Augmented Generation) system that decouples semantic retrieval from logical reasoning to improve cross-lingual video comprehension and reduce hallucinations. The two-stage pipeline uses dense retrieval with clean visual data followed by LLM-powered cognitive reranking, achieving strong precision in information retrieval and persona-conditioned generation.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Video Understanding by Design: How Datasets Shape Video Models

A comprehensive survey argues that dataset structure fundamentally shapes the evolution of video understanding models, connecting dataset characteristics to architectural innovations like transformers and multimodal foundation models. The research provides a unified framework explaining how different datasets drive specific inductive biases and architectural choices across video AI development.

AINeutralarXiv – CS AI · Jun 95/10

🧠

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

Researchers introduce SMART, a new multimodal AI framework for video moment retrieval that combines audio and visual features with shot-aware token compression to locate specific temporal segments in untrimmed videos. The method demonstrates significant performance improvements on benchmark datasets, achieving 1.61% and 2.59% gains in key metrics over previous state-of-the-art approaches.

AIBullisharXiv – CS AI · Jun 96/10

🧠

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem is a new memory compression framework for audio-visual large language models that enables efficient long-form video understanding by using modality-aware memory allocation and perturbation-aware token selection. The approach achieves 2-4% accuracy improvements over existing compression methods while reducing memory requirements, with potential applications in real-time video AI systems.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Multimodal Large Language Models as Synthetic Participants in Video-Based Studies: An Evaluation

Researchers evaluated whether multimodal large language models (MLLMs) like Gemini 3 Flash and Qwen 3 Omni can replicate human subjective responses in video perception tasks using the Perceived Message Sensation Value framework. The study found significant limitations: MLLMs demonstrated systematic biases including downward mean-shift, central-tendency bias, and inconsistent sensitivity to participant profiles, suggesting current models remain unreliable as synthetic human participants for subjective research.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 85/10

🧠

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

Researchers propose HSCHG, a novel framework for open-vocabulary audio-visual event localization that addresses temporal consistency and hierarchical semantic constraints by combining heterogeneous graphs in Euclidean space with hyperbolic space representations. The method uses hierarchical entailment regularization to improve recognition of unseen event categories while maintaining cross-modal alignment and semantic consistency across video and segment levels.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

A comprehensive review paper presents a unified framework for analyzing video understanding systems powered by multimodal large language models (MLLMs), organizing capabilities into three functional abilities: watching (perception), remembering (memory), and reasoning (inference). The work identifies key challenges in processing long, sparse, and knowledge-intensive video content while operating under computational constraints.

AIBullisharXiv – CS AI · Jun 86/10

🧠

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

Researchers introduce ViSSRes, an inference-time intervention method that reduces hallucinations in Video Large Multimodal Models by enhancing video representations through a lightweight MLP network. The approach achieves a 40.69% reduction in hallucination rates on LLaVA-NeXT-Video while improving video understanding by 18.36%, with minimal computational overhead during inference.

AINeutralarXiv – CS AI · Jun 56/10

🧠

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

Researchers introduce LongSpace-Bench, a video benchmark for evaluating multimodal AI models' ability to remember and retrieve spatial information across long videos, and propose LongSpace, a memory framework that improves long-horizon spatial reasoning by incorporating 3D structural cues and layer-aware memory retrieval.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Towards One-to-Many Temporal Grounding

Researchers introduce One-to-Many Temporal Grounding (OMTG), a new AI task for localizing multiple video segments matching a single text query. They establish the first OMTG benchmark with 56k samples and novel evaluation metrics, achieving 43.65% performance—outperforming advanced models like Gemini 2.5 Pro by 15.85%.

🧠 Gemini

AIBullisharXiv – CS AI · Jun 26/10

🧠

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

Researchers introduce pause-and-think-T, a reasoning-focused training dataset that enables compact Vision-Language Models to perform grounded video understanding and action suggestion tasks. A 4-billion parameter model fine-tuned on this dataset matches or exceeds much larger models (including GPT-4o and Qwen3-VL-235B) on benchmark tasks while demonstrating strong generalization to unseen datasets.

🧠 GPT-4🧠 GPT-5

AINeutralarXiv – CS AI · Jun 15/10

🧠

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

ConTrans, a novel neural network architecture, advances zero-shot temporal action localization by combining convolutional and transformer layers to capture both local frame dependencies and long-range video context. The approach achieves new benchmark performance on standard datasets, addressing limitations in existing methods that underutilize local correlations between frames.

AINeutralarXiv – CS AI · Jun 16/10

🧠

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

Researchers introduce CaptionFormer, an end-to-end model that simultaneously detects, segments, tracks, and captions objects in video sequences. The work addresses Dense Video Object Captioning by generating synthetic training data using vision-language models and extends existing datasets, achieving state-of-the-art results across multiple benchmarks.

AINeutralarXiv – CS AI · May 296/10

🧠

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Researchers propose a unified framework for long-form egocentric video understanding that separates reasoning into semantic and visual evidence streams, achieving competitive results on the HD-EPIC-VQA benchmark. The approach addresses fundamental limitations in how multimodal language models process extended video content by combining procedural structure extraction with fine-grained object grounding.

AIBullisharXiv – CS AI · May 286/10

🧠

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

Researchers introduce MOV-Bench, a benchmark for evaluating multi-hop audio-visual reasoning in large language models, and propose AOP-Agent, an agentic framework that enables open-source multimodal LLMs to perform active perception across temporally dispersed audio and visual evidence without additional training.

AIBullisharXiv – CS AI · May 286/10

🧠

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

VidPrism introduces a heterogeneous Mixture-of-Experts framework that enhances Vision-Language Models for video understanding by deploying specialized experts rather than identical generalists. The approach uses dynamic multi-rate sampling and bidirectional fusion to achieve state-of-the-art performance on video recognition benchmarks.

AINeutralarXiv – CS AI · May 285/10

🧠

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Researchers introduce the Video Important Person (VIP) identification task and Temporal-VIP dataset to automatically identify key individuals in video scenes while addressing the Temporal Importance Shift phenomenon. The VIP-Net framework achieves 67.3% accuracy, significantly outperforming existing methods (37.5%-53.9%), with applications in automated video editing and intelligent surveillance.

🏢 Hugging Face

AINeutralarXiv – CS AI · May 276/10

🧠

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

Researchers propose a novel game-theoretic approach to weakly-supervised video temporal grounding that models video frames and query words as cooperative game players to improve moment localization. The method addresses limitations in existing contrastive learning approaches by enabling fine-grained cross-modal interaction without relying on complex moment proposals, demonstrating superior performance on benchmark datasets.

AINeutralarXiv – CS AI · May 276/10

🧠

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

Researchers introduce DynFrame, an advanced video understanding framework that enables multimodal language models to dynamically select both temporal windows and frame sampling rates during inference. The approach achieves competitive performance with smaller 4B models against larger 7B-8B baselines and sets new state-of-the-art results with its 8B variant across six video understanding benchmarks.

AINeutralarXiv – CS AI · May 126/10

🧠

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

Researchers propose Grounded Correspondence, a new framework for video object tracking that replaces learned prediction models with deterministic bipartite matching. By leveraging existing vision backbone features, the approach achieves competitive results without learnable temporal parameters, challenging the conventional approach of using dynamics modules for temporal consistency.

AINeutralarXiv – CS AI · May 126/10

🧠

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

Researchers introduce STEMO-Bench, a benchmark for evaluating video understanding in multimodal large language models (MLLMs), and propose STEMO-Track, a framework that reduces hallucinations by explicitly tracking object identities and states across time. The work addresses a critical limitation in current video AI systems: their inability to persistently monitor objects and temporal relationships in dynamic scenes.

AINeutralarXiv – CS AI · May 116/10

🧠

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Response-G1 introduces a novel framework for real-time video understanding that uses explicit scene graphs to align video evidence with query-specific response conditions, enabling Video-LLMs to make more accurate timing decisions during streaming video analysis without requiring fine-tuning.

← PrevPage 2 of 3Next →