AIBullisharXiv – CS AI · Apr 66/10
🧠Researchers developed QAPruner, a new framework that simultaneously optimizes vision token pruning and post-training quantization for Multimodal Large Language Models (MLLMs). The method addresses the problem where traditional token pruning can discard important activation outliers needed for quantization stability, achieving 2.24% accuracy improvement over baselines while retaining only 12.5% of visual tokens.
AINeutralarXiv – CS AI · Mar 266/10
🧠Researchers introduce GeoSketch, a neural-symbolic AI framework that solves geometric problems through dynamic visual manipulation, including drawing auxiliary lines and applying transformations. The system combines perception, symbolic reasoning, and interactive sketch actions, achieving superior performance on geometric problem-solving benchmarks compared to static image processing methods.
AINeutralarXiv – CS AI · Mar 176/10
🧠Researchers introduce VTC-Bench, a comprehensive benchmark for evaluating multimodal AI models' ability to use visual tools for complex tasks. The benchmark reveals significant limitations in current models, with leading model Gemini-3.0-Pro achieving only 51% accuracy on multi-tool visual reasoning tasks.
🧠 Gemini
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers propose ES-Merging, a new framework for combining specialized biological multimodal large language models (MLLMs) by using embedding space signals rather than traditional parameter-based methods. The approach estimates merging coefficients at both layer-wise and element-wise granularities, outperforming existing merging techniques and even task-specific fine-tuned models on cross-modal scientific problems.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers developed UNIFIER, a continual learning framework for multimodal large language models (MLLMs) to adapt to changing visual scenarios without catastrophic forgetting. The framework addresses visual discrepancies across different environments like high-altitude, underwater, low-altitude, and indoor scenarios, showing significant improvements over existing methods.
🏢 Hugging Face
AINeutralarXiv – CS AI · Mar 116/10
🧠Researchers introduce EgoCross, a new benchmark to evaluate multimodal AI models on egocentric video understanding across diverse domains like surgery, extreme sports, and industrial settings. The study reveals that current AI models, including specialized egocentric models, struggle with cross-domain generalization beyond common daily activities.
AINeutralarXiv – CS AI · Mar 45/104
🧠Researchers introduce HSSBench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on Humanities and Social Sciences tasks across multiple languages. The benchmark contains over 13,000 samples and reveals significant challenges for current state-of-the-art models in cross-disciplinary reasoning.
AINeutralarXiv – CS AI · Mar 36/1011
🧠Researchers introduce LifeEval, a new multimodal benchmark designed to evaluate how well AI assistants can help humans in real-time daily life tasks from a first-person perspective. The benchmark reveals significant challenges for current AI models in providing timely and adaptive assistance in dynamic environments.
AINeutralarXiv – CS AI · Mar 36/107
🧠Researchers introduce MC-Search, the first benchmark for evaluating agentic multimodal retrieval-augmented generation (MM-RAG) systems with long, structured reasoning chains. The benchmark reveals systematic issues in current multimodal large language models and introduces Search-Align, a training framework that improves planning and retrieval accuracy.
AIBullisharXiv – CS AI · Mar 36/108
🧠FlowPortrait is a new reinforcement learning framework that uses Multimodal Large Language Models for evaluation to generate more realistic talking-head videos with better lip synchronization. The system combines human-aligned assessment with policy optimization techniques to address persistent issues in audio-driven portrait animation.
AIBullisharXiv – CS AI · Mar 37/107
🧠Researchers developed EmbedLens, a tool to analyze how multimodal large language models process visual information, finding that only 60% of visual tokens carry meaningful image-specific information. The study reveals significant inefficiencies in current MLLM architectures and proposes optimizations through selective token pruning and mid-layer injection.
AIBearisharXiv – CS AI · Mar 37/108
🧠Researchers have developed MIDAS, a new jailbreaking framework that successfully bypasses safety mechanisms in Multimodal Large Language Models by dispersing harmful content across multiple images. The technique achieved an 81.46% average attack success rate against four closed-source MLLMs by extending reasoning chains and reducing security attention.
$LINK
AIBullisharXiv – CS AI · Mar 36/102
🧠Researchers introduce SemHiTok, a unified image tokenizer that uses semantic-guided hierarchical codebooks to balance multimodal understanding and generation tasks. The system decouples semantic and pixel features through a novel architecture that builds pixel sub-codebooks on pretrained semantic codebooks, achieving superior performance in both image reconstruction and multimodal understanding.
AIBullisharXiv – CS AI · Mar 36/104
🧠Researchers propose MOON, the first generative multimodal large language model designed specifically for e-commerce product understanding. The model addresses key challenges in product representation learning through guided Mixture-of-Experts modules and semantic region detection, while introducing a new benchmark dataset for evaluation.
AINeutralarXiv – CS AI · Mar 35/103
🧠Researchers introduce C³B (Comics Cross-Cultural Benchmark), a new benchmark to test cultural awareness capabilities in Multimodal Large Language Models using over 2000 comic images and 18000 QA pairs. Testing revealed significant performance gaps between current MLLMs and human performance, highlighting the need for improved cultural understanding in AI systems.
AINeutralarXiv – CS AI · Mar 36/104
🧠Researchers introduce Vision-DeepResearch Benchmark (VDR-Bench) with 2,000 VQA instances to better evaluate multimodal AI systems' visual and textual search capabilities. The benchmark addresses limitations in existing evaluations where answers could be inferred without proper visual search, and proposes a multi-round cropped-search workflow to improve model performance.
$NEAR
AIBullisharXiv – CS AI · Mar 26/1019
🧠Researchers have developed EMO-R3, a new framework that enhances emotional reasoning capabilities in Multimodal Large Language Models through reflective reinforcement learning. The approach introduces structured emotional thinking and reflective rewards to improve interpretability and emotional intelligence in visual understanding tasks.
AIBullisharXiv – CS AI · Mar 26/1010
🧠Researchers introduce UMPIRE, a new training-free framework for quantifying uncertainty in Multimodal Large Language Models (MLLMs) across various input and output modalities. The system measures incoherence-adjusted semantic volume of model responses to better detect errors and improve reliability without requiring external tools or additional computational overhead.
AIBullisharXiv – CS AI · Feb 276/105
🧠Researchers introduce AOT (Adversarial Opponent Training), a self-play framework that improves Multimodal Large Language Models' robustness by having an AI attacker generate adversarial image manipulations to train a defender model. The method addresses perceptual fragility in MLLMs when processing visually complex scenes, reducing hallucinations through dynamic adversarial training.
AINeutralarXiv – CS AI · Mar 54/10
🧠Researchers evaluated five Multimodal Large Language Models (MLLMs) on their ability to reason about social norms in both text and image scenarios. GPT-4o performed best overall, while all models showed superior performance with text-based norm reasoning compared to image-based scenarios.
🧠 GPT-4
AINeutralarXiv – CS AI · Mar 34/105
🧠Researchers developed TAR-FAS, a new AI framework that uses external visual tools to improve face anti-spoofing detection across different domains. The system employs a Chain-of-Thought approach with visual tools to detect subtle spoofing patterns that traditional methods miss, achieving state-of-the-art performance.