AIBullisharXiv โ CS AI ยท 6h ago1
๐ง
What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
Researchers developed EmbedLens, a tool to analyze how multimodal large language models process visual information, finding that only 60% of visual tokens carry meaningful image-specific information. The study reveals significant inefficiencies in current MLLM architectures and proposes optimizations through selective token pruning and mid-layer injection.