Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs
Researchers propose Attention-Spectrum Regularization (ASR), a new continual learning framework for multimodal large language models that prevents catastrophic forgetting when adapting to new visual domains and tasks without replaying past data. ASR preserves cross-modal attention patterns by storing compact spectral statistics rather than actual training examples, demonstrating improved performance on vision-language benchmarks.
This research addresses a fundamental challenge in deploying multimodal AI systems that must continuously adapt to new tasks without degrading performance on previously learned capabilities. Traditional continual learning approaches either replay past data (memory-intensive and privacy-concerning) or use regularization techniques with limited control over attention pattern drift. ASR introduces a novel approach by treating cross-attention maps as spectral signals, compressing them into lightweight prototype distributions that capture the essential structure supporting old skills.
The technical innovation lies in using Fourier analysis to maintain skill-level attention coherence while permitting instance-specific adaptation. This balances plasticity—the ability to learn new tasks—with stability of previously acquired knowledge. The theoretical framework demonstrates that spectral drift provides meaningful control over forgetting under reasonable assumptions, while Fourier power spectra maintain robustness to spatial transformations and bounded perturbations.
For the AI industry, this work has practical implications for deploying MLLMs in non-stationary environments where models encounter diverse visual domains, question types, and user instructions sequentially. The replay-free property addresses privacy and computational efficiency concerns, making continual learning more feasible for real-world applications. Extensive benchmarking across VQA and instruction-tuning datasets validates the approach's effectiveness compared to existing methods.
Looking forward, this research opens pathways for more efficient lifelong learning in multimodal systems. The lightweight storage requirements and theoretical grounding suggest potential applications in edge deployment and resource-constrained scenarios. Future work may explore extensions to other modalities and integration with larger-scale foundation models.
- →ASR preserves cross-modal attention structure through spectral statistics rather than data replay, enabling memory-efficient continual learning
- →Fourier-based regularization prevents harmful attention drift while allowing task-specific adaptation and instance-level flexibility
- →Theoretical analysis proves spectral drift controls forgetting under spectral sufficiency assumptions with robustness guarantees
- →Comprehensive experiments show consistent improvements over replay-based, regularization-based, and adapter-based baselines across multiple benchmarks
- →Privacy-preserving approach eliminates need for storing or replaying past image-question pairs or pseudo-examples