y0news
#computer-vision39 articles
39 articles
AIBullisharXiv – CS AI · 6h ago5
🧠

DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

Researchers introduce DiffusionHarmonizer, an AI framework that enhances neural reconstruction simulations for autonomous robots by converting multi-step image diffusion models into single-step enhancers. The system addresses artifacts in NeRF and 3D Gaussian Splatting methods while improving realism for applications like self-driving vehicle simulation.

AIBullisharXiv – CS AI · 6h ago4
🧠

Radiologist Copilot: An Agentic Framework Orchestrating Specialized Tools for Reliable Radiology Reporting

Researchers have developed Radiologist Copilot, an AI agentic framework that orchestrates specialized tools to complete the entire radiology reporting workflow beyond simple report generation. The system integrates image localization, interpretation, template selection, report composition, and quality control to support radiologists throughout the comprehensive reporting process.

AINeutralarXiv – CS AI · 6h ago3
🧠

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Researchers introduce DLEBench, the first benchmark specifically designed to evaluate instruction-based image editing models' ability to edit small-scale objects that occupy only 1%-10% of image area. Testing on 10 models revealed significant performance gaps in small object editing, highlighting a critical limitation in current AI image editing capabilities.

AIBullisharXiv – CS AI · 6h ago6
🧠

DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

Researchers introduce DesignSense-10k, a dataset of 10,235 human-annotated preference pairs for evaluating graphic layout generation, along with DesignSense, a specialized AI model that outperforms existing models by 54.6% in layout quality assessment. The framework addresses the gap between AI-generated layouts and human aesthetic preferences, showing practical improvements in layout generation through reinforcement learning.

AIBullisharXiv – CS AI · 6h ago2
🧠

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Researchers introduce Sea² (See, Act, Adapt), a novel approach that improves AI perception models in new environments by using an intelligent pose-control agent rather than retraining the models themselves. The method keeps perception modules frozen and uses a vision-language model as a controller, achieving significant performance improvements of 13-27% across visual tasks without requiring additional training data.

AIBullisharXiv – CS AI · 6h ago4
🧠

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

Researchers introduce HDFLIM, a new framework that aligns vision and language AI models without requiring computationally expensive fine-tuning by using hyperdimensional computing to create cross-modal mappings while keeping foundation models frozen. The approach achieves comparable performance to traditional training methods while being significantly more resource-efficient.

AIBullisharXiv – CS AI · 6h ago3
🧠

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Researchers have developed an 'Omnivorous Vision Encoder' that creates consistent feature representations across different visual modalities (RGB, depth, segmentation) of the same scene. The framework addresses the poor cross-modal alignment in existing vision encoders like DINOv2 by training with dual objectives to maximize feature alignment while preserving discriminative semantics.

AIBullisharXiv – CS AI · 6h ago6
🧠

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

SceneTok introduces a novel 3D scene tokenizer that compresses view sets into permutation-invariant tokens, achieving 1-3 orders of magnitude better compression than existing methods while maintaining state-of-the-art reconstruction quality. The system enables efficient 3D scene generation in 5 seconds using a lightweight decoder that can render novel viewpoints.

AIBullisharXiv – CS AI · 6h ago6
🧠

Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving

Researchers introduce Max-V1, a novel vision-language model framework that treats autonomous driving as a language problem, predicting trajectories from camera input. The model achieved over 30% performance improvement on the nuScenes dataset and demonstrates strong cross-vehicle adaptability.

AINeutralarXiv – CS AI · 6h ago4
🧠

Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

Researchers introduce Veritas, a multi-modal large language model designed for deepfake detection that uses pattern-aware reasoning to mimic human forensic processes. The system addresses real-world challenges through the HydraFake dataset and achieves significant improvements in detecting unseen forgeries across different domains.

AIBullisharXiv – CS AI · 6h ago4
🧠

Interpretable Debiasing of Vision-Language Models for Social Fairness

Researchers have developed DeBiasLens, a new framework that uses sparse autoencoders to identify and deactivate social bias neurons in Vision-Language models without degrading their performance. The model-agnostic approach addresses concerns about unintended social bias in VLMs by making the debiasing process interpretable and targeting internal model dynamics rather than surface-level fixes.

AIBullisharXiv – CS AI · 6h ago2
🧠

Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics

Researchers developed TASOT, an unsupervised AI method for surgical phase recognition that combines visual and textual information without requiring expensive large-scale pre-training. The approach showed significant improvements over existing zero-shot methods across multiple surgical datasets, demonstrating that effective surgical AI can be achieved with more efficient training methods.

AIBullisharXiv – CS AI · 6h ago3
🧠

Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models

Researchers propose a new training method called pseudo contrastive learning to improve diagram comprehension in multimodal AI models like CLIP. The approach uses synthetic diagram samples to help models better understand fine-grained structural differences in diagrams, showing significant improvements in flowchart understanding tasks.

AIBullisharXiv – CS AI · 6h ago15
🧠

Reallocating Attention Across Layers to Reduce Multimodal Hallucination

Researchers propose a training-free solution to reduce hallucinations in multimodal AI models by rebalancing attention between perception and reasoning layers. The method achieves 4.2% improvement in reasoning accuracy with minimal computational overhead.

AIBullisharXiv – CS AI · 6h ago4
🧠

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Researchers introduce PointCoT, a new AI framework that enables multimodal large language models to perform explicit geometric reasoning on 3D point cloud data using Chain-of-Thought methodology. The framework addresses current limitations where AI models suffer from geometric hallucinations by implementing a 'Look, Think, then Answer' paradigm with 86k instruction-tuning samples.

AIBullisharXiv – CS AI · 6h ago5
🧠

SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection

Researchers introduce SALIENT, a frequency-aware diffusion model framework that improves detection of rare lesions in CT scans by generating synthetic training data in wavelet domain rather than pixel space. The approach addresses extreme class imbalance in medical imaging through controllable augmentation, achieving significant improvements in detection performance for low-prevalence conditions.

AIBullisharXiv – CS AI · 6h ago8
🧠

Reasoning-Driven Multimodal LLM for Domain Generalization

Researchers developed RD-MLDG, a new framework that uses multimodal large language models with reasoning chains to improve domain generalization in deep learning. The approach addresses challenges in cross-domain visual recognition by leveraging reasoning capabilities rather than just visual feature invariance, achieving state-of-the-art performance on standard benchmarks.

AIBullisharXiv – CS AI · 6h ago7
🧠

LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Researchers have developed LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic 3D virtual replicas suitable for AR/VR, gaming, robotics, and digital twins. The system features scene understanding, object retrieval, material painting, and physics integration to create graphics-ready environments that support object individuality and physically-based rendering.

AIBullisharXiv – CS AI · 6h ago4
🧠

Less is More: AMBER-AFNO -- a New Benchmark for Lightweight 3D Medical Image Segmentation

Researchers developed AMBER-AFNO, a new lightweight architecture for 3D medical image segmentation that replaces traditional attention mechanisms with Adaptive Fourier Neural Operators. The model achieves state-of-the-art results on medical datasets while maintaining linear memory scaling and quasi-linear computational complexity.

$NEAR
AINeutralarXiv – CS AI · 6h ago5
🧠

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Researchers introduce Ref-Adv, a new benchmark for testing multimodal large language models' visual reasoning capabilities in referring expression tasks. The benchmark reveals that current MLLMs, despite performing well on standard datasets like RefCOCO, rely heavily on shortcuts and show significant gaps in genuine visual reasoning and grounding abilities.

Page 1 of 2Next →