y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#computer-vision News & Analysis

507 articles tagged with #computer-vision. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

507 articles
AIBullisharXiv โ€“ CS AI ยท 2d ago7/10
๐Ÿง 

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

Researchers propose a method to adapt 2D multimodal large language models for 3D medical imaging analysis, introducing a Text-Guided Hierarchical Mixture of Experts framework that enables task-specific feature extraction. The approach demonstrates improved performance on medical report generation and visual question answering tasks while reusing pre-trained parameters from 2D models.

AIBullisharXiv โ€“ CS AI ยท 2d ago7/10
๐Ÿง 

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

Researchers introduce SpatialScore, a comprehensive benchmark with 5K samples across 30 tasks to evaluate multimodal language models' spatial reasoning capabilities. The work includes SpatialCorpus, a 331K-sample training dataset, and SpatialAgent, a multi-agent system with 12 specialized tools, demonstrating significant improvements in spatial intelligence without additional model training.

AIBullisharXiv โ€“ CS AI ยท 3d ago7/10
๐Ÿง 

Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation

Researchers propose Evidential Transformation Network (ETN), a lightweight post-hoc module that converts pretrained models into evidential models for uncertainty estimation without retraining. ETN operates in logit space using sample-dependent affine transformations and Dirichlet distributions, demonstrating improved uncertainty quantification across vision and language benchmarks with minimal computational overhead.

AIBearisharXiv โ€“ CS AI ยท 6d ago7/10
๐Ÿง 

Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible--Infrared Evasion

This research paper examines physical adversarial attacks on AI surveillance systems through a surveillance-oriented lens, emphasizing that robustness cannot be assessed from isolated image benchmarks alone. The study highlights critical gaps in current evaluation practices, including temporal persistence across frames, multi-modal sensing (visible and infrared), realistic attack carriers, and system-level objectives that must be tested under actual deployment constraints.

AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Researchers introduce V-Reflection, a new framework that transforms Multimodal Large Language Models (MLLMs) from passive observers to active interrogators through a 'think-then-look' mechanism. The approach addresses perception-related hallucinations in fine-grained tasks by allowing models to dynamically re-examine visual details during reasoning, showing significant improvements across six perception-intensive benchmarks.

AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.

AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

StableTTA: Training-Free Test-Time Adaptation that Improves Model Accuracy on ImageNet1K to 96%

Researchers developed StableTTA, a training-free method that significantly improves AI model accuracy on ImageNet-1K, with 33 models achieving over 95% accuracy and several surpassing 96%. The method allows lightweight architectures to outperform Vision Transformers while using 95% fewer parameters and 89% less computational cost.

AINeutralarXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale

Researchers developed a new AI-generated video detection framework using a large-scale dataset of 140K videos from 15 generators and the Qwen2.5-VL Vision Transformer. The method operates at native resolution to preserve high-frequency forgery artifacts typically lost in preprocessing, achieving superior performance in detecting synthetic media.

AIBullisharXiv โ€“ CS AI ยท Apr 67/10
๐Ÿง 

Training Multi-Image Vision Agents via End2End Reinforcement Learning

Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.

๐Ÿข OpenAI๐Ÿง  o1๐Ÿง  o3
AINeutralarXiv โ€“ CS AI ยท Apr 67/10
๐Ÿง 

SAGA: Source Attribution of Generative AI Videos

Researchers introduce SAGA, a comprehensive framework for identifying the specific AI models used to generate synthetic videos, moving beyond simple real/fake detection. The system provides multi-level attribution across authenticity, generation method, model version, and development team using only 0.5% of labeled training data.

AIBullisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Researchers developed GoldiCLIP, a data-efficient vision-language model that achieves state-of-the-art performance using only 30 million images - 300x less data than leading methods. The framework combines three key innovations including text-conditioned self-distillation, VQA-integrated encoding, and uncertainty-based loss weighting to significantly improve image-text retrieval tasks.

AIBullisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Ming-Flash-Omni is a new 100 billion parameter multimodal AI model with Mixture-of-Experts architecture that uses only 6.1 billion active parameters per token. The model demonstrates unified capabilities across vision, speech, and language tasks, achieving performance comparable to Gemini 2.5 Pro on vision-language benchmarks.

๐Ÿง  Gemini
AIBullisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

LLM4AD: Large Language Models for Autonomous Driving -- Concept, Review, Benchmark, Experiments, and Future Trends

Researchers have published a comprehensive review of Large Language Models for Autonomous Driving (LLM4AD), introducing new benchmarks and conducting real-world experiments on autonomous vehicle platforms. The paper explores how LLMs can enhance perception, decision-making, and motion control in self-driving cars, while identifying key challenges including latency, security, and safety concerns.

AIBearisharXiv โ€“ CS AI ยท Mar 277/10
๐Ÿง 

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Research reveals that open-source large language models (LLMs) lack hierarchical knowledge of visual taxonomies, creating a bottleneck for vision LLMs in hierarchical visual recognition tasks. The study used one million visual question answering tasks across six taxonomies to demonstrate this limitation, finding that even fine-tuning cannot overcome the underlying LLM knowledge gaps.

AIBullisharXiv โ€“ CS AI ยท Mar 267/10
๐Ÿง 

E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion

Researchers introduce E0, a new AI framework using tweedie discrete diffusion to improve Vision-Language-Action (VLA) models for robotic manipulation. The system addresses key limitations in existing VLA models by generating more precise actions through iterative denoising over quantized action tokens, achieving 10.7% better performance on average across 14 diverse robotic environments.

AIBullisharXiv โ€“ CS AI ยท Mar 267/10
๐Ÿง 

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

Researchers developed Attention Imbalance Rectification (AIR), a method to reduce object hallucinations in Large Vision-Language Models by correcting imbalanced attention allocation between vision and language modalities. The technique achieves up to 35.1% reduction in hallucination rates while improving general AI capabilities by up to 15.9%.

AIBearisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

AI Evasion and Impersonation Attacks on Facial Re-Identification with Activation Map Explanations

Researchers developed a novel framework for generating adversarial patches that can fool facial recognition systems through both evasion and impersonation attacks. The method reduces facial recognition accuracy from 90% to 0.4% in white-box settings and demonstrates strong cross-model generalization, highlighting critical vulnerabilities in surveillance systems.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

Researchers introduce MARVAL, a distillation framework that accelerates masked auto-regressive diffusion models by compressing inference into a single step while enabling practical reinforcement learning applications. The method achieves 30x speedup on ImageNet with comparable quality, making RL post-training feasible for the first time with these models.

AIBearisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

Researchers have developed the first physical adversarial attack targeting stereo-based depth estimation in autonomous vehicles, using 3D camouflaged objects that can fool binocular vision systems. The attack employs global texture patterns and a novel merging technique to create nearly invisible threats that cause stereo matching models to produce incorrect depth information.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

What Matters for Scalable and Robust Learning in End-to-End Driving Planners?

Researchers introduce BevAD, a new lightweight end-to-end autonomous driving architecture that achieves 72.7% success rate on the Bench2Drive benchmark. The study systematically analyzes architectural patterns in closed-loop driving performance, revealing limitations of open-loop dataset approaches and demonstrating strong data-scaling behavior through pure imitation learning.

AINeutralarXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

Researchers identified that medical multimodal large language models (MLLMs) fail primarily due to inadequate visual grounding capabilities when analyzing medical images, unlike their success with natural scenes. They developed VGMED evaluation dataset and proposed VGRefine method, achieving state-of-the-art performance across 6 medical visual question-answering benchmarks without additional training.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Researchers introduce PRIMO R1, a 7B parameter AI framework that transforms video MLLMs from passive observers into active critics for robotic manipulation tasks. The system uses reinforcement learning to achieve 50% better accuracy than specialized baselines and outperforms 72B-scale models, establishing state-of-the-art performance on the RoboFail benchmark.

๐Ÿข OpenAI๐Ÿง  o1
AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Researchers developed RieMind, a new AI framework that improves spatial reasoning in indoor scenes by 16-50% by separating visual perception from logical reasoning using explicit 3D scene graphs. The system grounds language models in structured geometric representations rather than processing videos end-to-end, achieving significantly better performance on spatial understanding benchmarks.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Researchers propose LESA, a new framework that accelerates Diffusion Transformers (DiTs) by up to 6.25x using learnable predictors and Kolmogorov-Arnold Networks. The method achieves significant speedups while maintaining or improving generation quality in text-to-image and text-to-video synthesis tasks.

Page 1 of 21Next โ†’