AIBullisharXiv – CS AI · 14h ago7/10
🧠
COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings
Researchers introduce COMET, a PLS-SVD framework that analyzes the modality gap in Contrastive Language-Audio Pretraining (CLAP) models by decomposing embeddings into interpretable concepts. The study reveals that only a small subset of shared conceptual axes drives similarity computation, and proposes a training-free spectral truncation method that improves zero-shot audio captioning performance while reducing dimensionality.