AIBullisharXiv – CS AI · Jun 27/10
🧠AdaCodec introduces a predictive visual coding approach for video multimodal large language models that adaptively allocates visual tokens based on scene complexity. Rather than encoding each frame independently as RGB images, the system sends full reference frames only when scenes are unpredictable and uses compact tokens for inter-frame changes, achieving superior performance at 1/7th the token budget while reducing latency significantly.
AIBullisharXiv – CS AI · May 287/10
🧠Researchers introduce CIVIC, a framework that optimizes Vision-Language Models by maintaining compact visual token sequences throughout the entire inference pipeline, reducing KV-cache memory to one-third while achieving measurable hardware acceleration without accuracy loss.
AINeutralarXiv – CS AI · Mar 277/10
🧠Research reveals that sparse autoencoder (SAE) features in vision-language models often fail to compose modularly for reasoning tasks. The study finds that combining task-selective feature sets frequently causes output drift and accuracy degradation, challenging assumptions used in AI model steering methods.
AIBullisharXiv – CS AI · Jun 116/10
🧠Researchers introduce ASRU, a machine unlearning framework for multimodal large language models that balances removing sensitive information with maintaining generation quality. The approach uses activation steering and reinforcement learning to achieve superior unlearning effectiveness while preserving model utility, demonstrating significant improvements on Qwen3-VL.
AINeutralarXiv – CS AI · Jun 95/10
🧠Researchers propose a two-stage vision-language framework using Qwen3-VL with LoRA fine-tuning to detect semiconductor lithography defects, then employ a refinement module trained on first-stage failures to improve accuracy beyond standard single-stage approaches.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers introduce ViCuR, a visual-grounded distillation framework that improves multimodal AI reasoning by using recoverable visual cues instead of answer-dependent privileges. The approach achieves consistent performance gains across seven benchmarks with Qwen3-VL models by eliminating train-test mismatches that encourage shortcut learning rather than genuine visual understanding.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce Temporal Token Fusion (TTF), a training-free compression technique that reduces visual tokens in video-language models by 67% while maintaining 99.5% accuracy. The method addresses the critical bottleneck of LLM prefill costs in video understanding by identifying and fusing redundant tokens across video frames using local similarity matching.
AIBullisharXiv – CS AI · May 46/10
🧠Researchers propose Persistent Visual Memory (PVM), a lightweight module that addresses visual signal degradation in Large Vision-Language Models by maintaining consistent visual perception during long text generation. Integrated into Qwen3-VL models, PVM demonstrates measurable accuracy improvements with minimal computational overhead, particularly benefiting complex reasoning tasks.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers introduce Visual-ERM, a multimodal reward model that improves vision-to-code tasks by evaluating visual equivalence in rendered outputs rather than relying on text-based rules. The system achieves significant performance gains on chart-to-code tasks (+8.4) and shows consistent improvements across table and SVG parsing applications.