AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers have mapped how Audio-Visual Large Language Models (AVLLMs) process and integrate audio and visual information internally, revealing distinct information flow patterns depending on input configuration. The study demonstrates that multimodal tokens can be pruned after information transfer with minimal performance impact, enabling more efficient inference across different model scales.
AIBullisharXiv – CS AI · 5d ago7/10
🧠Researchers introduce SPpruner, a new vision-language model optimization technique that reduces computational costs by intelligently filtering visual tokens while maintaining accuracy. The method achieves up to 2.53x speedup with minimal performance loss by prioritizing semantically relevant subjects and their contextual relationships, addressing a major bottleneck in VLM inference.
AIBullisharXiv – CS AI · May 297/10
🧠Researchers introduce OccamToken, a training-free method for compressing vision-language models by pruning unnecessary visual tokens while maintaining accuracy. The approach reduces visual token sequences by 98.6% (from 2,880 to 40 tokens) on LLaVA-NeXT while preserving over 93% accuracy, addressing computational bottlenecks in VLM inference.
AIBullisharXiv – CS AI · May 277/10
🧠Researchers propose VLA-Pruner, a novel token pruning method that accelerates Vision-Language-Action models for embodied AI by addressing the mismatch between semantic and action-critical visual processing. The method achieves up to 1.99x speedup while maintaining manipulation performance by considering both semantic context and temporal action relevance, unlike existing VLM pruning approaches.
AIBullisharXiv – CS AI · Apr 147/10
🧠SVD-Prune introduces a training-free token pruning method for Vision-Language Models using Singular Value Decomposition to reduce computational overhead. The approach maintains model performance while drastically reducing vision tokens to 16-32, addressing efficiency challenges in multimodal AI systems without requiring retraining.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers developed EvoPrune, a new method that prunes visual tokens during the encoding stage of Multimodal Large Language Models (MLLMs) rather than after encoding. The technique achieves 2x inference speedup with less than 1% performance loss on video datasets, addressing efficiency bottlenecks in AI models processing high-resolution images and videos.
AIBullisharXiv – CS AI · Mar 56/10
🧠Researchers have developed a lightweight token pruning framework that reduces computational costs for vision-language models in document understanding tasks by filtering out non-informative background regions before processing. The approach uses a binary patch-level classifier and max-pooling refinement to maintain accuracy while substantially lowering compute demands.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce AVIS, a lightweight adaptive policy that optimizes inference efficiency in Vision-Language Models by jointly scaling visual context and reasoning computation. The method uses token pruning and difficulty prediction to reduce computational costs while maintaining or improving accuracy across image and video reasoning tasks.
AIBullisharXiv – CS AI · May 286/10
🧠Researchers introduce OC-VTP, a lightweight vision token pruning method for Vision Language Models that reduces computational overhead by selectively retaining the most representative visual tokens without requiring model fine-tuning. The approach maintains inference accuracy across all pruning ratios while providing computational efficiency gains and interpretability benefits.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce COAST, a novel pruning framework for vision-language models that reduces visual tokens by 77.8% while maintaining 98.64% performance and achieving 2.15x speedup. Unlike existing methods that discard low-attention tokens, COAST uses adaptive semantic routing to preserve contextually essential information, preventing 'Visual Aphasia'—a failure mode where models lose visual grounding.
AIBullisharXiv – CS AI · Apr 156/10
🧠Researchers introduce CLASP, a token reduction framework that optimizes Multimodal Large Language Models by intelligently pruning visual tokens through class-adaptive layer fusion and dual-stage pruning. The approach addresses computational inefficiency in MLLMs while maintaining performance across diverse benchmarks and architectures.
AIBullisharXiv – CS AI · Apr 66/10
🧠Researchers have developed Efficient3D, a framework that accelerates 3D Multimodal Large Language Models (MLLMs) while maintaining accuracy through adaptive token pruning. The system uses a Debiased Visual Token Importance Estimator and Adaptive Token Rebalancing to reduce computational overhead without sacrificing performance, showing +2.57% CIDEr improvement on benchmarks.
AIBullisharXiv – CS AI · Apr 66/10
🧠Researchers developed QAPruner, a new framework that simultaneously optimizes vision token pruning and post-training quantization for Multimodal Large Language Models (MLLMs). The method addresses the problem where traditional token pruning can discard important activation outliers needed for quantization stability, achieving 2.24% accuracy improvement over baselines while retaining only 12.5% of visual tokens.
AIBullisharXiv – CS AI · Mar 96/10
🧠Researchers developed E-AdaPrune, an energy-driven adaptive pruning framework that optimizes Vision-Language Models by dynamically allocating visual tokens based on image information density. The method shows up to 0.6% average improvement across benchmarks, with a notable 5.1% boost on reasoning tasks, while adding only 8ms latency per image.
AIBullisharXiv – CS AI · Mar 37/107
🧠Researchers developed EmbedLens, a tool to analyze how multimodal large language models process visual information, finding that only 60% of visual tokens carry meaningful image-specific information. The study reveals significant inefficiencies in current MLLM architectures and proposes optimizations through selective token pruning and mid-layer injection.