224 articles tagged with #multimodal-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง PRAM-R introduces a new AI framework for autonomous driving that uses LLM-guided modality routing to adaptively select sensors based on environmental conditions. The system achieves 6.22% modality reduction while maintaining trajectory accuracy, demonstrating efficient resource management in multimodal perception systems.
AINeutralarXiv โ CS AI ยท Mar 57/10
๐ง Researchers introduced InEdit-Bench, the first evaluation benchmark specifically designed to test image editing models' ability to reason through intermediate logical pathways in multi-step visual transformations. Testing 14 representative models revealed significant shortcomings in handling complex scenarios requiring dynamic reasoning and procedural understanding.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers developed EvoPrune, a new method that prunes visual tokens during the encoding stage of Multimodal Large Language Models (MLLMs) rather than after encoding. The technique achieves 2x inference speedup with less than 1% performance loss on video datasets, addressing efficiency bottlenecks in AI models processing high-resolution images and videos.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that combines vision and language capabilities with strong performance in scientific and mathematical reasoning. The model demonstrates that careful architecture design and high-quality data curation can enable smaller models to achieve competitive performance with less computational resources.
AIBullisharXiv โ CS AI ยท Mar 46/102
๐ง PlayWrite is a new mixed-reality AI system that allows users to create stories by directly manipulating virtual characters and props in XR, rather than through traditional text prompts. The system uses multi-agent AI to interpret user actions into structured narrative elements and generates final stories via large language models, demonstrating a novel approach to AI-human creative collaboration.
AIBullisharXiv โ CS AI ยท Mar 47/103
๐ง Researchers introduce MIRAGE, a novel AI framework that uses knowledge graphs and electronic health records to predict Alzheimer's disease when MRI scans are unavailable. The system improves AD classification rates by 13% compared to single-modality approaches by creating synthetic representations without expensive 3D brain scan reconstruction.
AIBullisharXiv โ CS AI ยท Mar 47/102
๐ง Researchers have released MedXIAOHE, a new medical vision-language AI foundation model that achieves state-of-the-art performance across medical benchmarks and surpasses leading closed-source systems. The model incorporates advanced features like entity-aware pretraining, reinforcement learning for medical reasoning, and evidence-grounded report generation to improve reliability in clinical applications.
AIBearisharXiv โ CS AI ยท Mar 47/104
๐ง Researchers discovered a critical security vulnerability in AI-powered GUI agents on Android, where malicious apps can hijack agent actions without requiring dangerous permissions. The 'Action Rebinding' attack exploits timing gaps between AI observation and action, achieving 100% success rates in tests across six popular Android GUI agents.
AIBullisharXiv โ CS AI ยท Mar 46/103
๐ง Researchers developed a new training-free decoding strategy for Large Vision-Language Models that reduces hallucinations by using query-adaptive visual augmentation and entropy-based token selection. The method showed significant improvements in factual consistency across four LVLMs and seven benchmarks compared to existing approaches.
AIBearisharXiv โ CS AI ยท Mar 47/102
๐ง Researchers have identified a critical privacy vulnerability in multi-modal large reasoning models (MLRMs) where adversaries can infer users' sensitive location information from images, including home addresses from selfies. The study introduces DoxBench dataset and demonstrates that 11 advanced MLRMs consistently outperform humans in geolocation inference, significantly lowering barriers for privacy attacks.
AINeutralarXiv โ CS AI ยท Mar 46/103
๐ง Researchers introduce CFE-Bench, a new multimodal benchmark for evaluating AI reasoning across 20+ STEM domains using authentic university exam problems. The best performing model, Gemini-3.1-pro-preview, achieved only 59.69% accuracy, highlighting significant gaps in AI reasoning capabilities, particularly in maintaining correct intermediate states through multi-step solutions.
AIBullisharXiv โ CS AI ยท Mar 46/102
๐ง Researchers introduce Perception-R1, a new approach to enhance multimodal reasoning in large language models by improving visual perception capabilities through reinforcement learning with visual perception rewards. The method achieves state-of-the-art performance on multimodal reasoning benchmarks using only 1,442 training samples.
AINeutralarXiv โ CS AI ยท Mar 46/102
๐ง Researchers introduce UniG2U-Bench, a comprehensive benchmark testing whether unified multimodal AI models that can generate content actually understand better than traditional vision-language models. The study of over 30 models reveals that unified models generally underperform their base counterparts, though they show improvements in spatial intelligence and visual reasoning tasks.
AIBullisharXiv โ CS AI ยท Mar 46/105
๐ง Researchers propose iGVLM, a new framework that addresses limitations in Large Vision-Language Models by introducing dynamic instruction-guided visual encoding. The system uses a dual-branch architecture to enable task-specific visual reasoning while preserving pre-trained visual knowledge.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers have published a comprehensive survey exploring the integration of Large Language Models (LLMs) with Uncrewed Aerial Vehicles (UAVs), proposing a unified framework for intelligent drone operations. The study examines how LLMs can enhance UAV capabilities including swarm coordination, navigation, mission planning, and human-drone interaction through advanced reasoning and multimodal processing.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduce UME-R1, a breakthrough multimodal embedding framework that combines discriminative and generative approaches using reasoning-driven AI. The system demonstrates significant performance improvements across 78 benchmark tasks by leveraging generative reasoning capabilities of multimodal large language models.
AINeutralarXiv โ CS AI ยท Mar 37/103
๐ง Researchers have introduced WorldSense, the first benchmark for evaluating multimodal AI systems that process visual, audio, and text inputs simultaneously. The benchmark contains 1,662 synchronized audio-visual videos across 67 subcategories and 3,172 QA pairs, revealing that current state-of-the-art models achieve only 65.1% accuracy on real-world understanding tasks.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers developed NANOMIND, a software-hardware framework that optimizes Large Multimodal Models for battery-powered devices by breaking them into modular components and mapping each to optimal accelerators. The system achieves 42.3% energy reduction and enables 20.8 hours of operation running LLaVA-OneVision on a compact device without network connectivity.
AINeutralarXiv โ CS AI ยท Mar 37/104
๐ง Researchers identify a 'safety mirage' problem in vision language models where supervised fine-tuning creates spurious correlations that make models vulnerable to simple attacks and overly cautious with benign queries. They propose machine unlearning as an alternative that reduces attack success rates by up to 60.27% and unnecessary rejections by over 84.20%.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers introduce Uni-X, a novel architecture for unified multimodal AI models that addresses gradient conflicts between vision and text processing. The X-shaped design uses modality-specific processing at input/output layers while sharing middle layers, achieving superior efficiency and matching 7B parameter models with only 3B parameters.
$UNI
AINeutralarXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduced MMR-Life, a comprehensive benchmark with 2,646 questions and 19,108 real-world images to evaluate multimodal reasoning capabilities of AI models. Even top models like GPT-5 achieved only 58% accuracy, highlighting significant challenges in real-world multimodal reasoning across seven different reasoning types.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce UniWeTok, a unified binary tokenizer with a massive 2^128 codebook for multimodal large language models. The system achieves state-of-the-art image generation performance on ImageNet while requiring significantly less training compute than existing solutions.
AIBullisharXiv โ CS AI ยท Feb 277/104
๐ง Researchers developed PathVis, a mixed-reality platform for Apple Vision Pro that revolutionizes digital pathology by allowing pathologists to examine gigapixel cancer diagnostic images through immersive visualization and multimodal AI assistance. The system replaces traditional 2D monitor limitations with natural interactions using eye gaze, hand gestures, and voice commands, integrated with AI agents for computer-aided diagnosis.
AIBullisharXiv โ CS AI ยท Feb 277/107
๐ง Researchers introduce OmniGAIA, a comprehensive benchmark for evaluating omni-modal AI agents that can process video, audio, and image data simultaneously with complex reasoning capabilities. They also propose OmniAtlas, a foundation agent that enhances existing open-source models' ability to use tools across multiple modalities, marking progress toward more capable AI assistants.
AIBullisharXiv โ CS AI ยท Feb 277/107
๐ง Researchers propose a 'Trinity of Consistency' framework for developing General World Models in AI, consisting of Modal, Spatial, and Temporal consistency principles. They introduce CoW-Bench, a new benchmark for evaluating video generation models and unified multimodal models, aiming to establish a principled pathway toward AGI-capable world simulation systems.