Mitigating Cross-Image Information Leakage in Multi-Image Understanding with Large Vision-Language Models
Researchers introduce FOCUS, a training-free method that improves Large Vision-Language Models' ability to process multiple images by masking irrelevant images with noise, preventing visual information from different images from becoming entangled in the model's representations.
Large Vision-Language Models demonstrate strong capabilities on individual image tasks but suffer significant performance degradation when processing multiple images simultaneously. Researchers have identified a previously poorly understood phenomenon called cross-image information leakage, where visual elements from different images become entangled in the model's internal representations, leading to confused outputs and reduced accuracy. This discovery addresses a critical limitation that has constrained practical applications of LVLMs in multi-modal scenarios.
The FOCUS method represents an elegant solution to this problem without requiring model retraining or architectural changes. By masking all but one image with random noise during inference, the approach forces the model to concentrate on individual images sequentially. The logits generated from each masked context are then aggregated and refined using a noise-only reference input that suppresses leakage artifacts. This technique demonstrates consistent improvements across diverse multi-image benchmarks and extends to video understanding, suggesting broad applicability to temporal visual data.
For the AI development community, this work has significant implications. The method's training-free nature means it can be immediately applied to existing deployed models without computational overhead or architectural modification, reducing implementation barriers. The ability to handle multi-image inputs more effectively opens pathways for improved visual reasoning systems with applications in document analysis, comparative image understanding, and sequential visual reasoning tasks.
Looking forward, researchers should investigate whether FOCUS principles apply to other multi-modal challenges and whether the method's effectiveness scales to models handling increasingly complex visual scenarios. Understanding whether similar leakage occurs in other architectural designs and exploring more sophisticated aggregation strategies could further enhance performance.
- βCross-image information leakage causes Large Vision-Language Models to confuse visual elements across multiple inputs, degrading multi-image understanding performance.
- βFOCUS uses masking and noise-based refinement to isolate individual images during inference without requiring model retraining or architecture changes.
- βThe method consistently improves performance on multi-image benchmarks and generalizes to video understanding tasks.
- βTraining-free solutions enable immediate deployment on existing models across various applications without computational overhead.
- βThe technique reveals fundamental limitations in how current LVLMs process sequential or parallel visual inputs.