AINeutralarXiv – CS AI · 4d ago7/10
🧠Researchers introduce EgoBench, a new benchmark for evaluating AI agents' ability to perceive visual information, reason through multi-step tasks, and interact with users in real-world scenarios. Testing eight state-of-the-art video models reveals significant limitations, with the best performer achieving only 30.62% accuracy, exposing critical gaps in current AI agent capabilities.
AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce Tensor Memory, a fixed-size recurrent module that augments Transformers with persistent 3D spatial state for improved long-sequence processing. The approach enables better video understanding and occlusion reasoning by decoupling memory capacity from input length while maintaining computational efficiency.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce MAGIC-Video, a training-free framework that enables multimodal AI systems to process and reason about ultra-long videos spanning days or weeks by combining a structured memory graph with narrative chains. The system outperforms existing baselines on multiple benchmarks, addressing a critical limitation where current LLMs can only handle tens of minutes of video despite having million-token context windows.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce HY-Himmel, a hierarchical video-language framework that efficiently processes long videos by separating semantic and motion encoding tasks. The system uses sparse keyframes for visual grounding while a lightweight adapter extracts motion information from compressed video data, achieving better performance than dense-frame baselines while reducing token usage by 3.6x.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce Video Understanding Reward Bench (VURB), a comprehensive benchmark with 2,100 preference pairs for evaluating video reward models, alongside VUP-35K, a large-scale dataset of 35,000 preference examples. Two new models, VideoDRM and VideoGRM, achieve state-of-the-art performance on video understanding tasks, advancing multimodal AI capabilities beyond text and images.
AIBullisharXiv – CS AI · Apr 147/10
🧠TimeRewarder is a new machine learning method that learns dense reward signals from passive videos to improve reinforcement learning in robotics. By modeling temporal distances between video frames, the approach achieves 90% success rates on Meta-World tasks using significantly fewer environment interactions than prior methods, while also leveraging human videos for scalable reward learning.
AINeutralarXiv – CS AI · Mar 267/10
🧠Researchers propose DIG, a training-free framework that improves long-form video understanding by adapting frame selection strategies based on query types. The system uses uniform sampling for global queries and specialized selection for localized queries, achieving better performance than existing methods while scaling to 256 input frames.
AIBullisharXiv – CS AI · Feb 277/107
🧠Molmo2 is a new open-source family of vision-language models that achieves state-of-the-art performance among open models, particularly excelling in video understanding and pixel-level grounding tasks. The research introduces 7 new video datasets and 2 multi-image datasets collected without using proprietary VLMs, along with an 8B parameter model that outperforms existing open-weight models and even some proprietary models on specific tasks.
AINeutralarXiv – CS AI · 18h ago6/10
🧠Researchers introduce CaptionFormer, an end-to-end model that simultaneously detects, segments, tracks, and captions objects in video sequences. The work addresses Dense Video Object Captioning by generating synthetic training data using vision-language models and extends existing datasets, achieving state-of-the-art results across multiple benchmarks.
AINeutralarXiv – CS AI · 18h ago5/10
🧠ConTrans, a novel neural network architecture, advances zero-shot temporal action localization by combining convolutional and transformer layers to capture both local frame dependencies and long-range video context. The approach achieves new benchmark performance on standard datasets, addressing limitations in existing methods that underutilize local correlations between frames.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers propose a unified framework for long-form egocentric video understanding that separates reasoning into semantic and visual evidence streams, achieving competitive results on the HD-EPIC-VQA benchmark. The approach addresses fundamental limitations in how multimodal language models process extended video content by combining procedural structure extraction with fine-grained object grounding.
AINeutralarXiv – CS AI · 4d ago5/10
🧠Researchers introduce the Video Important Person (VIP) identification task and Temporal-VIP dataset to automatically identify key individuals in video scenes while addressing the Temporal Importance Shift phenomenon. The VIP-Net framework achieves 67.3% accuracy, significantly outperforming existing methods (37.5%-53.9%), with applications in automated video editing and intelligent surveillance.
🏢 Hugging Face
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers introduce MOV-Bench, a benchmark for evaluating multi-hop audio-visual reasoning in large language models, and propose AOP-Agent, an agentic framework that enables open-source multimodal LLMs to perform active perception across temporally dispersed audio and visual evidence without additional training.
AIBullisharXiv – CS AI · 4d ago6/10
🧠VidPrism introduces a heterogeneous Mixture-of-Experts framework that enhances Vision-Language Models for video understanding by deploying specialized experts rather than identical generalists. The approach uses dynamic multi-rate sampling and bidirectional fusion to achieve state-of-the-art performance on video recognition benchmarks.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers propose a novel game-theoretic approach to weakly-supervised video temporal grounding that models video frames and query words as cooperative game players to improve moment localization. The method addresses limitations in existing contrastive learning approaches by enabling fine-grained cross-modal interaction without relying on complex moment proposals, demonstrating superior performance on benchmark datasets.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduce DynFrame, an advanced video understanding framework that enables multimodal language models to dynamically select both temporal windows and frame sampling rates during inference. The approach achieves competitive performance with smaller 4B models against larger 7B-8B baselines and sets new state-of-the-art results with its 8B variant across six video understanding benchmarks.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose Grounded Correspondence, a new framework for video object tracking that replaces learned prediction models with deterministic bipartite matching. By leveraging existing vision backbone features, the approach achieves competitive results without learnable temporal parameters, challenging the conventional approach of using dynamics modules for temporal consistency.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce STEMO-Bench, a benchmark for evaluating video understanding in multimodal large language models (MLLMs), and propose STEMO-Track, a framework that reduces hallucinations by explicitly tracking object identities and states across time. The work addresses a critical limitation in current video AI systems: their inability to persistently monitor objects and temporal relationships in dynamic scenes.
AIBullisharXiv – CS AI · May 116/10
🧠SAVEMem is a training-free framework that improves real-time video understanding by incorporating semantic awareness into memory management rather than relying solely on visual similarity. The system achieves significant performance gains on streaming video benchmarks while reducing GPU memory consumption by 48%, demonstrating practical advances in efficient AI model inference.
AINeutralarXiv – CS AI · May 116/10
🧠Response-G1 introduces a novel framework for real-time video understanding that uses explicit scene graphs to align video evidence with query-specific response conditions, enabling Video-LLMs to make more accurate timing decisions during streaming video analysis without requiring fine-tuning.
AIBullisharXiv – CS AI · May 96/10
🧠Researchers introduce NOVA, a world modeling framework that represents scene state as weights in implicit neural representations (INRs) rather than traditional encoded latent spaces. The approach eliminates decoder bottlenecks, achieves structural disentanglement of scene components, and enables controllable video generation on consumer GPUs with only 40M parameters.
AIBullisharXiv – CS AI · Apr 146/10
🧠Researchers introduce BoxTuning, a novel approach for improving video understanding in multimodal AI models by rendering object bounding boxes directly onto video frames as visual prompts rather than encoding them as text tokens. The method achieves 87-93% reduction in text token usage while maintaining full temporal resolution, demonstrating superior performance on video question-answering tasks.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduced HumanVBench, a comprehensive benchmark for evaluating how well multimodal AI models understand human-centric video content across 16 tasks including emotion recognition and speech-visual alignment. The study evaluated 30 leading MLLMs and found significant performance gaps, even among top proprietary models, while introducing automated synthesis pipelines to enable scalable benchmark creation with minimal human effort.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce AV-SpeakerBench, a new 3,212-question benchmark designed to evaluate how well multimodal large language models understand audiovisual speech by correlating speakers with their dialogue and timing. Testing reveals Gemini 2.5 Pro significantly outperforms open-source competitors, with the gap primarily attributable to inferior audiovisual fusion capabilities rather than visual perception limitations.
🧠 Gemini
AIBullisharXiv – CS AI · Apr 66/10
🧠Researchers propose a fully end-to-end training paradigm for temporal sentence grounding in videos, introducing the Sentence Conditioned Adapter (SCADA) to better align video understanding with natural language queries. The method outperforms existing approaches by jointly optimizing video backbones and localization components rather than using frozen pre-trained encoders.