AINeutralarXiv – CS AI · 9h ago5/10
🧠
To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection
Researchers propose a query-adaptive audio-visual person retrieval system that intelligently detects which modalities (voice or face) are actually present in broadcast video archives, avoiding noise from absent modalities. By analyzing cross-modal score consistency, the system achieves 94.2% precision on BBC Rewind's 12,000+ videos, significantly outperforming both unimodal and fixed fusion approaches.