18 articles tagged with #video-understanding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 3d ago7/10
๐ง TimeRewarder is a new machine learning method that learns dense reward signals from passive videos to improve reinforcement learning in robotics. By modeling temporal distances between video frames, the approach achieves 90% success rates on Meta-World tasks using significantly fewer environment interactions than prior methods, while also leveraging human videos for scalable reward learning.
AINeutralarXiv โ CS AI ยท Mar 267/10
๐ง Researchers propose DIG, a training-free framework that improves long-form video understanding by adapting frame selection strategies based on query types. The system uses uniform sampling for global queries and specialized selection for localized queries, achieving better performance than existing methods while scaling to 256 input frames.
AIBullisharXiv โ CS AI ยท Feb 277/107
๐ง Molmo2 is a new open-source family of vision-language models that achieves state-of-the-art performance among open models, particularly excelling in video understanding and pixel-level grounding tasks. The research introduces 7 new video datasets and 2 multi-image datasets collected without using proprietary VLMs, along with an 8B parameter model that outperforms existing open-weight models and even some proprietary models on specific tasks.
AIBullisharXiv โ CS AI ยท 3d ago6/10
๐ง Researchers introduce BoxTuning, a novel approach for improving video understanding in multimodal AI models by rendering object bounding boxes directly onto video frames as visual prompts rather than encoding them as text tokens. The method achieves 87-93% reduction in text token usage while maintaining full temporal resolution, demonstrating superior performance on video question-answering tasks.
AINeutralarXiv โ CS AI ยท 3d ago6/10
๐ง Researchers introduced HumanVBench, a comprehensive benchmark for evaluating how well multimodal AI models understand human-centric video content across 16 tasks including emotion recognition and speech-visual alignment. The study evaluated 30 leading MLLMs and found significant performance gaps, even among top proprietary models, while introducing automated synthesis pipelines to enable scalable benchmark creation with minimal human effort.
AINeutralarXiv โ CS AI ยท 4d ago6/10
๐ง Researchers introduce AV-SpeakerBench, a new 3,212-question benchmark designed to evaluate how well multimodal large language models understand audiovisual speech by correlating speakers with their dialogue and timing. Testing reveals Gemini 2.5 Pro significantly outperforms open-source competitors, with the gap primarily attributable to inferior audiovisual fusion capabilities rather than visual perception limitations.
๐ง Gemini
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers propose a fully end-to-end training paradigm for temporal sentence grounding in videos, introducing the Sentence Conditioned Adapter (SCADA) to better align video understanding with natural language queries. The method outperforms existing approaches by jointly optimizing video backbones and localization components rather than using frozen pre-trained encoders.
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Researchers introduce TimeLens, a family of multimodal large language models optimized for video temporal grounding that outperforms existing open-source models and even surpasses proprietary models like GPT-5 and Gemini-2.5-Flash. The work addresses critical data quality issues in existing benchmarks and introduces improved training datasets and algorithmic design principles.
๐ง GPT-5๐ง Gemini
AINeutralarXiv โ CS AI ยท Mar 266/10
๐ง Researchers introduce GameplayQA, a new benchmarking framework for evaluating multimodal large language models on 3D virtual agent perception and reasoning tasks. The framework uses densely annotated multiplayer gameplay videos with 2.4K diagnostic QA pairs, revealing substantial performance gaps between current frontier models and human-level understanding.
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers introduced LensWalk, an agentic AI framework that enables Large Language Models to actively control their visual observation of videos through dynamic temporal sampling. The system uses a reason-plan-observe loop to progressively gather evidence, achieving 5% accuracy improvements on challenging video benchmarks without requiring model fine-tuning.
AINeutralarXiv โ CS AI ยท Mar 116/10
๐ง Researchers introduce EgoCross, a new benchmark to evaluate multimodal AI models on egocentric video understanding across diverse domains like surgery, extreme sports, and industrial settings. The study reveals that current AI models, including specialized egocentric models, struggle with cross-domain generalization beyond common daily activities.
AINeutralarXiv โ CS AI ยท Mar 45/103
๐ง Researchers introduce VideoTemp-o3, a new AI framework that improves long-video understanding by intelligently identifying relevant video segments and performing targeted analysis. The system addresses key limitations in current video AI models including weak localization and rigid workflows through unified masking mechanisms and reinforcement learning rewards.
AIBullisharXiv โ CS AI ยท Mar 37/109
๐ง Researchers have developed MM-Mem, a new pyramidal multimodal memory architecture that enables AI systems to better understand long-horizon videos by mimicking human cognitive memory processes. The system addresses current limitations in multimodal large language models by creating a hierarchical memory structure that progressively distills detailed visual information into high-level semantic understanding.
AIBullisharXiv โ CS AI ยท Mar 36/103
๐ง FluxMem is a new training-free framework for streaming video understanding that uses hierarchical memory compression to reduce computational costs. The system achieves state-of-the-art performance on video benchmarks while reducing latency by 69.9% and GPU memory usage by 34.5%.
AIBullisharXiv โ CS AI ยท Mar 36/104
๐ง Researchers developed CaCoVID, a reinforcement learning-based algorithm that compresses video tokens for large language models by selecting tokens based on their actual contribution to correct predictions rather than attention scores. The method uses combinatorial policy optimization to reduce computational overhead while maintaining video understanding performance.
AIBullisharXiv โ CS AI ยท Feb 275/107
๐ง Researchers developed MomentMix and Length-Aware DETR to improve video moment retrieval, addressing challenges in localizing short video segments based on natural language queries. The method achieves significant performance gains on benchmark datasets, with up to 16.9% improvement in average mAP on QVHighlights.
AIBullishHugging Face Blog ยท Feb 206/105
๐ง SmolVLM2 represents an advancement in multimodal AI technology, bringing video understanding capabilities to smaller devices. This development suggests progress in making AI models more accessible and efficient for edge computing applications.
AINeutralarXiv โ CS AI ยท Mar 115/10
๐ง Researchers introduce MA-EgoQA, a benchmark for evaluating AI models' ability to understand multiple egocentric video streams from embodied agents simultaneously. The benchmark includes 1.7k questions across five categories and reveals current approaches struggle with multi-agent system-level understanding.