🧠 AI⚪ NeutralImportance 6/10

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

arXiv – CS AI|Shiyu Li, Zhiyuan Hu, Yifan Wang, Peiming Li, Zheng Wei, Yang Tang|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Conan-embedding-v3, a framework that enables unified embedding spaces across multiple data modalities (text, image, video, audio, documents) by training specialized models independently and fusing them into a single backbone. The approach identifies and solves a critical technical challenge called 'Projector Drift' that causes audio retrieval performance degradation when external encoders are integrated.

Analysis

Conan-embedding-v3 addresses a fundamental challenge in machine learning: creating universal embedding spaces that handle radically different input types without sacrificing performance. Traditional approaches struggle because text, images, videos, and audio have distinct statistical properties and optimal learning strategies. The researchers' decoupled-fuse-recover framework elegantly sidesteps this by letting each modality develop specialized representations before merging them, which maintains the optimization efficiency of focused training while achieving cross-modal retrieval capabilities.

The discovery of Projector Drift represents an important technical insight for the broader AI community. When audio modules connect through external encoders and projection layers, simply fusing the backbone backbone leaves these downstream components calibrated to outdated feature representations, creating a structural mismatch. This failure mode likely applies beyond audio to any projector-based auxiliary modality, making the recovery procedure of targeted fine-tuning while freezing the backbone a reusable pattern for future multi-modal systems.

The benchmark results demonstrate practical viability: achieving 74.9 scores on MMEB with strong performance on the 30-task MAEB audio suite shows the approach scales across diverse evaluation paradigms. This matters for developers building retrieval systems, as unified embeddings reduce infrastructure complexity and inference costs compared to maintaining separate models. Organizations can deploy single inference engines rather than orchestrating multiple specialized systems, improving latency-sensitive applications like search and recommendation systems.

Key Takeaways

→Conan-embedding-v3 enables text, image, video, audio, and document retrieval in a single embedding backbone through modality-specialist fusion
→Projector Drift reveals that external encoders become misaligned when backbones are fused, requiring targeted fine-tuning recovery procedures
→The decoupled training approach maintains individual modality optimization quality while achieving unified cross-modal retrieval
→Benchmark performance (74.9 MMEB, 55.61 MAEB) validates the framework's ability to handle diverse multi-modal evaluation tasks
→This architecture pattern reduces deployment complexity by eliminating the need to maintain multiple specialized models in production systems