507 articles tagged with #computer-vision. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduced VideoSafetyEval, a benchmark revealing that video-based large language models have 34.2% worse safety performance than image-based models. They developed VideoSafety-R1, a dual-stage framework that achieves 71.1% improvement in safety through alarm token-guided fine-tuning and safety-guided reinforcement learning.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers developed RieMind, a new AI framework that improves spatial reasoning in indoor scenes by 16-50% by separating visual perception from logical reasoning using explicit 3D scene graphs. The system grounds language models in structured geometric representations rather than processing videos end-to-end, achieving significantly better performance on spatial understanding benchmarks.
AIBearisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers developed a novel framework for generating adversarial patches that can fool facial recognition systems through both evasion and impersonation attacks. The method reduces facial recognition accuracy from 90% to 0.4% in white-box settings and demonstrates strong cross-model generalization, highlighting critical vulnerabilities in surveillance systems.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce BevAD, a new lightweight end-to-end autonomous driving architecture that achieves 72.7% success rate on the Bench2Drive benchmark. The study systematically analyzes architectural patterns in closed-loop driving performance, revealing limitations of open-loop dataset approaches and demonstrating strong data-scaling behavior through pure imitation learning.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce PRIMO R1, a 7B parameter AI framework that transforms video MLLMs from passive observers into active critics for robotic manipulation tasks. The system uses reinforcement learning to achieve 50% better accuracy than specialized baselines and outperforms 72B-scale models, establishing state-of-the-art performance on the RoboFail benchmark.
๐ข OpenAI๐ง o1
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง DriveMind introduces a new AI framework combining vision-language models with reinforcement learning for autonomous driving, achieving significant performance improvements in safety and route completion. The system demonstrates strong cross-domain generalization from simulation to real-world dash-cam data, suggesting practical deployment potential.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers introduce improved methods for stitching Vision Foundation Models (VFMs) like CLIP and DINOv2, enabling integration of different models' strengths. The study proposes VFM Stitch Tree (VST) technique that allows controllable accuracy-latency trade-offs for multimodal applications.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers propose AIM, a novel AI model modulation paradigm that allows a single model to exhibit diverse behaviors without maintaining multiple specialized versions. The approach uses logits redistribution to enable dynamic control over output quality and input feature focus without requiring retraining or additional training data.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers propose ROVA, a new training framework that improves vision-language models' robustness in real-world conditions by up to 24% accuracy gains. The framework addresses performance degradation from weather, occlusion, and camera motion that can cause up to 35% accuracy drops in current models.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers developed HyMEM, a brain-inspired hybrid memory system that significantly improves GUI agents' ability to interact with computers. The system uses graph-based structured memory combining symbolic nodes with trajectory embeddings, enabling smaller 7B/8B models to match or exceed performance of larger closed-source models like GPT-4o.
๐ง GPT-4
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers introduce BiCLIP, a new framework that improves vision-language models' ability to adapt to specialized domains through geometric transformations. The approach achieves state-of-the-art results across 11 benchmarks while maintaining simplicity and low computational requirements.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers introduce World2Mind, a training-free spatial intelligence toolkit that enhances foundation models' 3D spatial reasoning capabilities by up to 18%. The system uses 3D reconstruction and cognitive mapping to create structured spatial representations, enabling text-only models to perform complex spatial reasoning tasks.
๐ง GPT-5
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers introduce FCDM, a fully convolutional diffusion model based on ConvNeXt architecture that achieves competitive performance with DiT-XL/2 using only 50% of the computational resources. The model demonstrates exceptional training efficiency, requiring 7x fewer training steps and can be trained on just 4 GPUs, reviving convolutional networks as an efficient alternative to Transformer-based diffusion models.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers developed EyExIn, a new AI framework that addresses critical gaps in large vision language models for medical diagnosis by anchoring them with domain-specific expert knowledge. The system uses dual-stream encoding and deep expert injection to improve accuracy in ophthalmic diagnosis, outperforming existing proprietary systems across four benchmarks.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduce RAG-Driver, a retrieval-augmented multi-modal large language model designed for autonomous driving that can provide explainable decisions and control predictions. The system addresses data scarcity and generalization challenges in AI-driven autonomous vehicles by using in-context learning and expert demonstration retrieval.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduce PSIVG, a framework that integrates physical simulators into AI video generation to ensure generated videos obey real-world physics like gravity and collision. The system reconstructs 4D scenes from template videos and uses physical simulations to guide video generators toward more realistic motion while maintaining visual quality.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduce BEVLM, a framework that integrates Large Language Models with Bird's-Eye View representations for autonomous driving. The approach improves LLM reasoning accuracy in cross-view driving scenarios by 46% and enhances end-to-end driving performance by 29% in safety-critical situations.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduced TADPO, a novel reinforcement learning approach that extends PPO for autonomous off-road driving. The system achieved successful zero-shot sim-to-real transfer on a full-scale off-road vehicle, marking the first RL-based policy deployment on such a platform.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduced SPARC, a framework that creates unified latent spaces across different AI models and modalities, enabling direct comparison of how various architectures represent identical concepts. The method achieves 0.80 Jaccard similarity on Open Images, tripling alignment compared to previous methods, and enables practical applications like text-guided spatial localization in vision-only models.
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers have developed CanvasMAR, a new masked autoregressive video prediction model that generates high-quality videos with fewer sampling steps by using a "canvas" approach that provides global structure early in the generation process. The model demonstrates superior performance on major benchmarks including BAIR, UCF-101, and Kinetics-600, rivaling advanced diffusion-based methods.
AINeutralarXiv โ CS AI ยท Mar 56/10
๐ง Researchers have identified Order-to-Space Bias (OTS) in modern image generation models, where the order entities are mentioned in text prompts incorrectly determines spatial layout and role assignments. The study introduces OTS-Bench to measure this bias and demonstrates that targeted fine-tuning and early-stage interventions can reduce the problem while maintaining generation quality.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers developed MPFlow, a new zero-shot MRI reconstruction framework that uses multi-modal data and rectified flow to improve medical imaging quality. The system reduces tumor hallucinations by 15% while using 80% fewer sampling steps compared to existing diffusion methods, potentially advancing AI applications in medical diagnostics.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers have developed Sim2Sea, a comprehensive framework that successfully bridges the simulation-to-reality gap for autonomous maritime vessel navigation in congested waters. The system uses GPU-accelerated parallel simulation, dual-stream spatiotemporal policy, and targeted domain randomization to achieve zero-shot transfer from simulation to real-world deployment on a 17-ton unmanned vessel.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers introduce GeoSeg, a zero-shot, training-free framework for AI-driven segmentation of remote sensing imagery that uses multimodal language models for reasoning without requiring specialized training data. The system addresses domain-specific challenges in satellite and aerial image analysis through bias-aware coordinate refinement and dual-route prompting mechanisms.
AIBearisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers have developed Image-based Prompt Injection (IPI), a black-box attack that embeds adversarial instructions into natural images to manipulate multimodal AI models. Testing on GPT-4-turbo achieved up to 64% attack success rate, demonstrating a significant security vulnerability in vision-language AI systems.
๐ง GPT-4