y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#video-understanding News & Analysis

18 articles tagged with #video-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

18 articles
AIBullisharXiv โ€“ CS AI ยท 3d ago7/10
๐Ÿง 

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder is a new machine learning method that learns dense reward signals from passive videos to improve reinforcement learning in robotics. By modeling temporal distances between video frames, the approach achieves 90% success rates on Meta-World tasks using significantly fewer environment interactions than prior methods, while also leveraging human videos for scalable reward learning.

AINeutralarXiv โ€“ CS AI ยท Mar 267/10
๐Ÿง 

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Researchers propose DIG, a training-free framework that improves long-form video understanding by adapting frame selection strategies based on query types. The system uses uniform sampling for global queries and specialized selection for localized queries, achieving better performance than existing methods while scaling to 256 input frames.

AIBullisharXiv โ€“ CS AI ยท Feb 277/107
๐Ÿง 

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Molmo2 is a new open-source family of vision-language models that achieves state-of-the-art performance among open models, particularly excelling in video understanding and pixel-level grounding tasks. The research introduces 7 new video datasets and 2 multi-image datasets collected without using proprietary VLMs, along with an 8B parameter model that outperforms existing open-weight models and even some proprietary models on specific tasks.

AIBullisharXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Researchers introduce BoxTuning, a novel approach for improving video understanding in multimodal AI models by rendering object bounding boxes directly onto video frames as visual prompts rather than encoding them as text tokens. The method achieves 87-93% reduction in text token usage while maintaining full temporal resolution, demonstrating superior performance on video question-answering tasks.

AINeutralarXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

Researchers introduced HumanVBench, a comprehensive benchmark for evaluating how well multimodal AI models understand human-centric video content across 16 tasks including emotion recognition and speech-visual alignment. The study evaluated 30 leading MLLMs and found significant performance gaps, even among top proprietary models, while introducing automated synthesis pipelines to enable scalable benchmark creation with minimal human effort.

AINeutralarXiv โ€“ CS AI ยท 4d ago6/10
๐Ÿง 

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Researchers introduce AV-SpeakerBench, a new 3,212-question benchmark designed to evaluate how well multimodal large language models understand audiovisual speech by correlating speakers with their dialogue and timing. Testing reveals Gemini 2.5 Pro significantly outperforms open-source competitors, with the gap primarily attributable to inferior audiovisual fusion capabilities rather than visual perception limitations.

๐Ÿง  Gemini
AIBullisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Researchers propose a fully end-to-end training paradigm for temporal sentence grounding in videos, introducing the Sentence Conditioned Adapter (SCADA) to better align video understanding with natural language queries. The method outperforms existing approaches by jointly optimizing video backbones and localization components rather than using frozen pre-trained encoders.

AIBullisharXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Researchers introduce TimeLens, a family of multimodal large language models optimized for video temporal grounding that outperforms existing open-source models and even surpasses proprietary models like GPT-5 and Gemini-2.5-Flash. The work addresses critical data quality issues in existing benchmarks and introduces improved training datasets and algorithmic design principles.

๐Ÿง  GPT-5๐Ÿง  Gemini
AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Researchers introduce GameplayQA, a new benchmarking framework for evaluating multimodal large language models on 3D virtual agent perception and reasoning tasks. The framework uses densely annotated multiplayer gameplay videos with 2.4K diagnostic QA pairs, revealing substantial performance gaps between current frontier models and human-level understanding.

AIBullisharXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Researchers introduced LensWalk, an agentic AI framework that enables Large Language Models to actively control their visual observation of videos through dynamic temporal sampling. The system uses a reason-plan-observe loop to progressively gather evidence, achieving 5% accuracy improvements on challenging video benchmarks without requiring model fine-tuning.

AINeutralarXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering

Researchers introduce EgoCross, a new benchmark to evaluate multimodal AI models on egocentric video understanding across diverse domains like surgery, extreme sports, and industrial settings. The study reveals that current AI models, including specialized egocentric models, struggle with cross-domain generalization beyond common daily activities.

AINeutralarXiv โ€“ CS AI ยท Mar 45/103
๐Ÿง 

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Researchers introduce VideoTemp-o3, a new AI framework that improves long-video understanding by intelligently identifying relevant video segments and performing targeted analysis. The system addresses key limitations in current video AI models including weak localization and rigid workflows through unified masking mechanisms and reinforcement learning rewards.

AIBullisharXiv โ€“ CS AI ยท Mar 37/109
๐Ÿง 

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Researchers have developed MM-Mem, a new pyramidal multimodal memory architecture that enables AI systems to better understand long-horizon videos by mimicking human cognitive memory processes. The system addresses current limitations in multimodal large language models by creating a hierarchical memory structure that progressively distills detailed visual information into high-level semantic understanding.

AIBullisharXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

FluxMem is a new training-free framework for streaming video understanding that uses hierarchical memory compression to reduce computational costs. The system achieves state-of-the-art performance on video benchmarks while reducing latency by 69.9% and GPU memory usage by 34.5%.

AIBullisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

Researchers developed CaCoVID, a reinforcement learning-based algorithm that compresses video tokens for large language models by selecting tokens based on their actual contribution to correct predictions rather than attention scores. The method uses combinatorial policy optimization to reduce computational overhead while maintaining video understanding performance.

AIBullisharXiv โ€“ CS AI ยท Feb 275/107
๐Ÿง 

MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

Researchers developed MomentMix and Length-Aware DETR to improve video moment retrieval, addressing challenges in localizing short video segments based on natural language queries. The method achieves significant performance gains on benchmark datasets, with up to 16.9% improvement in average mAP on QVHighlights.

AIBullishHugging Face Blog ยท Feb 206/105
๐Ÿง 

SmolVLM2: Bringing Video Understanding to Every Device

SmolVLM2 represents an advancement in multimodal AI technology, bringing video understanding capabilities to smaller devices. This development suggests progress in making AI models more accessible and efficient for edge computing applications.

AINeutralarXiv โ€“ CS AI ยท Mar 115/10
๐Ÿง 

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Researchers introduce MA-EgoQA, a benchmark for evaluating AI models' ability to understand multiple egocentric video streams from embodied agents simultaneously. The benchmark includes 1.7k questions across five categories and reveals current approaches struggle with multi-agent system-level understanding.