AINeutralarXiv – CS AI · 9h ago6/10
🧠
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
Researchers propose SFFL, a framework that mitigates cross-modal interference in audio-visual language models by enforcing separate reasoning chains for each modality before fusion. The approach uses modality-preference labels and reinforcement learning to reduce hallucinations and achieves 5-11% performance improvements on benchmarks.