Researchers introduce FIGMA, a new multi-view contrastive learning architecture that significantly improves music retrieval based on fine-grained musical attributes like tempo, key, and chord progression. The work addresses a fundamental limitation in existing CLAP-based models that fail to process detailed musical descriptions, achieving up to 73.3% relative improvement and contributing a new 380K music-caption dataset (FGMCaps) to the field.
The FIGMA research identifies and solves a critical problem in audio-text retrieval systems: existing contrastive learning models like CLAP ignore most of the information in detailed prompts, effectively reading only the first few tokens regardless of caption length. This finding reveals a fundamental mismatch between training data richness and actual model utilization, a pattern that likely extends beyond music retrieval to other multimodal domains.
The proposed solution employs a dual-optimization strategy combining global audio-text alignment with frame-level, token-wise alignment. This architectural innovation enables models to simultaneously capture semantic context and specific musical characteristics—a critical capability for applications requiring precision, such as music production tools, DJ software, and music recommendation systems.
The introduction of FGMCaps dataset, containing 380K music-caption pairs annotated with structured musical attributes, represents a valuable research resource that moves the field toward standardized benchmarking in fine-grained audio understanding. The 73.3% performance improvement over CLAP-based baselines, including on out-of-domain data, suggests the approach generalizes well beyond its training distribution.
This work signals growing maturity in multimodal AI, where researchers now optimize for specific use cases rather than generic semantic alignment. For developers building music applications, these advances enable more natural, attribute-specific search interfaces. The methodology—joint optimization of global and local alignment—may prove applicable to other domains requiring detailed attribute understanding, from medical imaging to product search.
- →FIGMA uses multi-view contrastive learning to capture both semantic and fine-grained musical attributes, outperforming CLAP-based models by up to 73.3%.
- →Existing audio-text models like CLAP utilize only initial tokens from detailed prompts, effectively ignoring most caption information despite being trained on long descriptions.
- →The FGMCaps dataset of 380K annotated music-caption pairs provides new infrastructure for fine-grained audio retrieval research and benchmarking.
- →Frame-level, token-wise alignment enables models to process specific musical properties like tempo, key, and chord progression alongside semantic understanding.
- →Strong out-of-domain performance suggests the approach generalizes beyond training data, improving robustness for real-world music retrieval applications.