AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce ESRT, a privacy-preserving edge-cloud framework for multilingual speech-to-text translation that processes voice data locally while transmitting only compressed features to the cloud. The system achieves state-of-the-art performance across 45 languages while reducing bandwidth requirements by 10x and preventing voiceprint leakage.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce VisualNeedle, a benchmark that exposes limitations in multimodal large language models' ability to perform genuine fine-grained visual search in information-dense scenes. Despite frontier MLLMs reporting over 90% accuracy on existing benchmarks, VisualNeedle reveals that these models struggle significantly when critical evidence is spatially constrained to minute regions, with the best model achieving only 56% accuracy versus 63% human performance.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers introduced CAIT, a benchmark testing multimodal large language models' ability to understand counter-intuitive visual scenes that contradict common sense. The study reveals that open-source MLLMs fail dramatically at these tasks due to language bias, automatically overriding visual evidence with statistically common text patterns, while proprietary models like Claude and Gemini demonstrate robust performance.
🧠 Claude🧠 Gemini
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce MAGIC-Video, a training-free framework that enables multimodal AI systems to process and reason about ultra-long videos spanning days or weeks by combining a structured memory graph with narrative chains. The system outperforms existing baselines on multiple benchmarks, addressing a critical limitation where current LLMs can only handle tens of minutes of video despite having million-token context windows.
AIBullisharXiv – CS AI · May 127/10
🧠Flame3D introduces a training-free framework that enables large language models to reason about 3D scenes compositionally without requiring 3D-specific training data. The system represents scenes as editable visual-textual memories and allows agents to synthesize custom spatial programs at inference time, achieving competitive results on existing benchmarks while opening new possibilities for multi-hop spatial reasoning.
AIBullisharXiv – CS AI · May 77/10
🧠RetentiveKV introduces an entropy-driven optimization method for multimodal large language models that achieves 5x KV cache compression and 1.5x decoding acceleration by reformulating token eviction as continuous memory evolution rather than discrete pruning. The approach addresses limitations of existing compression methods by accounting for visual tokens that gain importance later in decoding and preserving spatial continuity of visual information.
AIBullisharXiv – CS AI · Apr 157/10
🧠Researchers introduce DocSeeker, a multimodal AI system designed to improve long document understanding by implementing structured analysis, localization, and reasoning workflows. The breakthrough addresses critical limitations in existing large language models that struggle with lengthy documents due to high noise levels and weak training signals, achieving superior performance on both short and ultra-long documents.
AIBullisharXiv – CS AI · Apr 147/10
🧠MM-LIMA demonstrates that multimodal large language models can achieve superior performance using only 200 high-quality instruction examples—6% of the data used in comparable systems. Researchers developed quality metrics and an automated data selector to filter vision-language datasets, showing that strategic data curation outweighs raw dataset size in model alignment.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce LAST, a framework that enhances multimodal large language models' spatial reasoning by integrating specialized vision tools through an interactive sandbox interface. The approach achieves ~20% performance improvements over baseline models and outperforms proprietary closed-source LLMs on spatial reasoning tasks by converting complex tool outputs into consumable hints for language models.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers propose a method to adapt 2D multimodal large language models for 3D medical imaging analysis, introducing a Text-Guided Hierarchical Mixture of Experts framework that enables task-specific feature extraction. The approach demonstrates improved performance on medical report generation and visual question answering tasks while reusing pre-trained parameters from 2D models.
AIBullisharXiv – CS AI · Apr 107/10
🧠Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.
AIBullisharXiv – CS AI · Apr 107/10
🧠Researchers propose Faithful-First RPA, a framework that improves multimodal AI reasoning by prioritizing faithfulness to visual evidence. The method uses FaithEvi for supervision and FaithAct for execution, achieving up to 24% improvement in perceptual faithfulness without sacrificing task accuracy.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers developed SToRM, a new framework that reduces computational costs for autonomous driving systems using multi-modal large language models by up to 30x while maintaining performance. The system uses supervised token reduction techniques to enable real-time end-to-end driving on standard GPUs without sacrificing safety or accuracy.
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers identified that medical multimodal large language models (MLLMs) fail primarily due to inadequate visual grounding capabilities when analyzing medical images, unlike their success with natural scenes. They developed VGMED evaluation dataset and proposed VGRefine method, achieving state-of-the-art performance across 6 medical visual question-answering benchmarks without additional training.
AIBullisharXiv – CS AI · Mar 167/10
🧠Researchers developed HeteroServe, a system that optimizes multimodal large language model inference by partitioning vision encoding and language generation across different GPU tiers. The approach reduces data transfer requirements and achieves 31-40% cost savings while improving throughput by up to 54% compared to existing systems.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed Meissa, a lightweight 4B-parameter medical AI model that brings advanced agentic capabilities offline for healthcare applications. The system matches frontier models like GPT in medical benchmarks while operating with 25x fewer parameters and 22x lower latency, addressing privacy and cost concerns in clinical settings.
🧠 Gemini
AINeutralarXiv – CS AI · Mar 57/10
🧠Researchers introduce SpatialBench, a comprehensive benchmark for evaluating spatial cognition in multimodal large language models (MLLMs). The framework reveals that while MLLMs excel at perceptual grounding, they struggle with symbolic reasoning, causal inference, and planning compared to humans who demonstrate more goal-directed spatial abstraction.
AIBullisharXiv – CS AI · Mar 47/103
🧠Researchers introduce OptMerge, a new benchmark and method for combining multiple expert Multimodal Large Language Models (MLLMs) into single, more capable models without requiring additional training data. The approach achieves 2.48% average performance gains while reducing storage and serving costs by merging models across different modalities like vision, audio, and video.
AIBullisharXiv – CS AI · Mar 37/105
🧠Researchers propose Vid-LLM, a new video-based 3D multimodal large language model that processes video inputs without requiring external 3D data for scene understanding. The model uses a Cross-Task Adapter module and Metric Depth Model to integrate geometric cues and maintain consistency across 3D tasks like question answering and visual grounding.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers developed SpiroLLM, the first multimodal large language model capable of understanding spirogram time series data for COPD diagnosis. Using data from 234,028 UK Biobank individuals, the model achieved 0.8977 diagnostic AUROC and maintained 100% valid response rate even with missing data, far outperforming text-only models.
AINeutralarXiv – CS AI · Feb 277/106
🧠Researchers introduce ProactiveMobile, a new benchmark for developing AI agents that can proactively anticipate user needs on mobile devices rather than just responding to commands. The benchmark includes over 3,600 test instances across 14 scenarios, with current models achieving low success rates, indicating significant room for improvement in proactive AI capabilities.
AINeutralarXiv – CS AI · Feb 277/106
🧠Researchers identified a fundamental limitation in multimodal LLMs where decoders trained on text cannot effectively utilize non-text information like speaker identity or visual textures, despite this information being preserved through all model layers. The study demonstrates this 'modality collapse' is due to decoder design rather than encoding failures, with experiments showing targeted training can improve specific modality accessibility.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce MOV-Bench, a benchmark for evaluating multi-hop audio-visual reasoning in large language models, and propose AOP-Agent, an agentic framework that enables open-source multimodal LLMs to perform active perception across temporally dispersed audio and visual evidence without additional training.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduced PhyWorldBench, a comprehensive benchmark that evaluates text-to-video generation models on their ability to simulate real-world physics accurately. Testing 12 state-of-the-art models across 1,050 prompts, the study reveals significant gaps in how current AI video generators handle physical phenomena, from basic object motion to complex interactions, while also introducing novel evaluation methods using multimodal language models.