#visual-tokens News & Analysis

5 articles tagged with #visual-tokens. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBullisharXiv – CS AI · May 77/10

🧠

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

RetentiveKV introduces an entropy-driven optimization method for multimodal large language models that achieves 5x KV cache compression and 1.5x decoding acceleration by reformulating token eviction as continuous memory evolution rather than discrete pruning. The approach addresses limitations of existing compression methods by accounting for visual tokens that gain importance later in decoding and preserving spatial continuity of visual information.

AIBullisharXiv – CS AI · Jun 116/10

🧠

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

Researchers introduce MultiToP, a framework that reduces hallucinations in video language models by selectively replacing unreliable visual tokens before text generation. The method achieves 50.60% F1 score improvement on hallucination benchmarks while maintaining general video understanding performance, demonstrating that targeted token refinement can enhance multimodal AI reliability without modifying base models.

AIBullisharXiv – CS AI · May 116/10

🧠

TTF: Temporal Token Fusion for Efficient Video-Language Model

Researchers introduce Temporal Token Fusion (TTF), a training-free compression technique that reduces visual tokens in video-language models by 67% while maintaining 99.5% accuracy. The method addresses the critical bottleneck of LLM prefill costs in video understanding by identifying and fusing redundant tokens across video frames using local similarity matching.

AIBullisharXiv – CS AI · Mar 37/107

🧠

What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

Researchers developed EmbedLens, a tool to analyze how multimodal large language models process visual information, finding that only 60% of visual tokens carry meaningful image-specific information. The study reveals significant inefficiencies in current MLLM architectures and proposes optimizations through selective token pruning and mid-layer injection.

AINeutralarXiv – CS AI · Mar 44/102

🧠

Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling

Researchers developed a novel approach for Chinese language modeling using low-resolution visual images of characters instead of traditional text tokens. The method achieved comparable accuracy (39.2%) to index-based models while showing faster initial learning, demonstrating that visual structure can effectively represent logographic scripts.