AINeutralarXiv – CS AI · Mar 177/10
🧠Research comparing 200 humans and 95 AI detectors found humans significantly outperform AI at detecting deepfakes, especially in low-quality mobile phone videos where AI accuracy drops to near chance levels. The study reveals human-AI hybrid systems are most effective, as humans and AI make complementary errors in deepfake detection.
AINeutralarXiv – CS AI · Jun 116/10
🧠Researchers developed a multimodal machine learning approach using frozen pretrained encoders (CLIP, Whisper, RoBERTa) to predict personality traits and cognitive ability from asynchronous video interviews, achieving 19.1% improvement over baseline on personality assessment but revealing potential dataset shortcuts in cognitive ability evaluation.
AINeutralarXiv – CS AI · Jun 116/10
🧠RelayFormer is a new deep learning framework that unifies image and video manipulation detection through a flexible attention mechanism called Global Local Relay (GLR) tokens. The approach handles variable resolutions without distortion and processes both static and temporal data with a single architecture, addressing key limitations in current visual forensics methods.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers introduce Spatio-Temporal Bound Propagation (STBP), a verification framework for neural networks processing video and volumetric data that provides formal robustness guarantees under realistic adversarial constraints. The method achieves 1.7x higher certified robust accuracy compared to existing approaches while maintaining computational scalability, addressing a critical gap in AI safety for applications like autonomous driving and medical imaging.
AINeutralarXiv – CS AI · Jun 86/10
🧠Researchers introduce MotionEnhancer, a novel technique that combines Video Diffusion Models with Vision-Language Models to improve fine-grained motion understanding in video analysis. The parameter-free approach uses attention alignment to extract motion priors without requiring additional training or architectural modifications, achieving consistent improvements on motion-understanding benchmarks.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers introduce UNIVID, a unified vision-language model designed for large-scale video moderation that generates interpretable policy-aware captions instead of opaque classification outputs. The system reduces violation detection errors by 42.7% and false positives by 37.0% while consolidating over 1,000 specialized models into a single backbone, demonstrating practical AI efficiency gains in content moderation infrastructure.
AINeutralarXiv – CS AI · Jun 25/10
🧠Researchers introduce UE-MCM, a dual-model AI system that combines small and large models to detect mistakes in egocentric instructional videos, particularly excelling at identifying rare errors through adaptive fusion and long-tailed distribution handling. The approach balances computational efficiency with accuracy for practical deployment in video analysis tasks.
AINeutralarXiv – CS AI · May 285/10
🧠Researchers introduce the Video Important Person (VIP) identification task and Temporal-VIP dataset to automatically identify key individuals in video scenes while addressing the Temporal Importance Shift phenomenon. The VIP-Net framework achieves 67.3% accuracy, significantly outperforming existing methods (37.5%-53.9%), with applications in automated video editing and intelligent surveillance.
🏢 Hugging Face
AINeutralarXiv – CS AI · May 276/10
🧠Researchers have developed an interpretable AI framework for assessing suicide risk in metro stations using surveillance video analysis, achieving 83.2% ROC-AUC by combining person tracking, activity recognition, and trajectory analysis. This work addresses a critical public health challenge by enabling early identification of high-risk situations that could facilitate timely intervention.
AIBullisharXiv – CS AI · Mar 266/10
🧠Researchers introduced LensWalk, an agentic AI framework that enables Large Language Models to actively control their visual observation of videos through dynamic temporal sampling. The system uses a reason-plan-observe loop to progressively gather evidence, achieving 5% accuracy improvements on challenging video benchmarks without requiring model fine-tuning.
AINeutralarXiv – CS AI · Mar 176/10
🧠Research reveals that humans can detect credibility issues in deepfake videos through visual and audio distortions. Three experiments show that both technical artifacts and distortions in synthetic media reduce perceived credibility, though understanding of human perception of deepfakes remains limited.
AIBullisharXiv – CS AI · Mar 175/10
🧠Researchers developed a question-aware keyframe selection framework for video question answering that uses large multimodal models to generate pseudo labels and coverage regularization. The method significantly improves accuracy on temporal and causal questions in the NExT-QA dataset, making video analysis more efficient by reducing inference costs.
AIBullisharXiv – CS AI · Mar 175/10
🧠Researchers have developed a Video-Guided Post-ASR Correction (VPC) framework that uses Video-Large Multimodal Models to improve speech recognition accuracy in complex environments like TV series. The system addresses challenges with multiple speakers, overlapping speech, and domain-specific terminology by leveraging video context to refine ASR outputs.
AINeutralarXiv – CS AI · Mar 164/10
🧠Team LEYA developed a multimodal AI approach for recognizing ambivalence and hesitancy in videos for the 10th ABAW Competition, combining scene, facial, audio, and text analysis. Their fusion model achieved 83.25% accuracy compared to 70.02% for single-modality approaches, demonstrating significant improvements in behavioral recognition technology.