y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#video-understanding News & Analysis

33 articles tagged with #video-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

33 articles
AINeutralarXiv – CS AI · Mar 45/103
🧠

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Researchers introduce VideoTemp-o3, a new AI framework that improves long-video understanding by intelligently identifying relevant video segments and performing targeted analysis. The system addresses key limitations in current video AI models including weak localization and rigid workflows through unified masking mechanisms and reinforcement learning rewards.

AIBullisharXiv – CS AI · Mar 37/109
🧠

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Researchers have developed MM-Mem, a new pyramidal multimodal memory architecture that enables AI systems to better understand long-horizon videos by mimicking human cognitive memory processes. The system addresses current limitations in multimodal large language models by creating a hierarchical memory structure that progressively distills detailed visual information into high-level semantic understanding.

AIBullisharXiv – CS AI · Mar 36/103
🧠

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

FluxMem is a new training-free framework for streaming video understanding that uses hierarchical memory compression to reduce computational costs. The system achieves state-of-the-art performance on video benchmarks while reducing latency by 69.9% and GPU memory usage by 34.5%.

AIBullisharXiv – CS AI · Mar 36/104
🧠

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

Researchers developed CaCoVID, a reinforcement learning-based algorithm that compresses video tokens for large language models by selecting tokens based on their actual contribution to correct predictions rather than attention scores. The method uses combinatorial policy optimization to reduce computational overhead while maintaining video understanding performance.

AIBullisharXiv – CS AI · Feb 275/107
🧠

MomentMix Augmentation with Length-Aware DETR for Temporally Robust Moment Retrieval

Researchers developed MomentMix and Length-Aware DETR to improve video moment retrieval, addressing challenges in localizing short video segments based on natural language queries. The method achieves significant performance gains on benchmark datasets, with up to 16.9% improvement in average mAP on QVHighlights.

AIBullishHugging Face Blog · Feb 206/105
🧠

SmolVLM2: Bringing Video Understanding to Every Device

SmolVLM2 represents an advancement in multimodal AI technology, bringing video understanding capabilities to smaller devices. This development suggests progress in making AI models more accessible and efficient for edge computing applications.

AINeutralarXiv – CS AI · Mar 115/10
🧠

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Researchers introduce MA-EgoQA, a benchmark for evaluating AI models' ability to understand multiple egocentric video streams from embodied agents simultaneously. The benchmark includes 1.7k questions across five categories and reveals current approaches struggle with multi-agent system-level understanding.

← PrevPage 2 of 2