y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models

arXiv – CS AI|Yuanwei Hu, Bo Peng, Yadan Luo, Zhen Fang, Ling Chen, Jie Lu|
🤖AI Summary

Researchers challenge the standard approach of using text embeddings as class prototypes in out-of-distribution detection with vision-language models, demonstrating a fundamental misalignment between text and visual feature spaces. They propose an online pseudo-supervised framework that learns visual prototypes directly from unlabeled test data, achieving state-of-the-art OOD detection performance.

Analysis

This research addresses a critical gap in how pre-trained vision-language models handle out-of-distribution detection. The paper identifies that existing methods rely on text embeddings of class names as prototypes, but this approach inherently suffers from a modality gap—the visual and textual feature spaces are fundamentally misaligned, and prompt engineering alone cannot resolve this structural limitation.

The advancement builds on the recent success of vision-language models like CLIP in enabling zero-shot learning without access to training data. However, the authors demonstrate that this convenience comes at a cost: text-derived prototypes are suboptimal for visual classification tasks. The theoretical contribution proves that this gap exists intrinsically, shifting focus from engineering better prompts to fundamentally reconceptualizing how prototypes should be constructed.

The proposed solution leverages unlabeled test-time data streams through an online pseudo-supervised learning framework. This approach learns visual prototypes directly in the visual feature space while respecting the post-hoc constraint—meaning it works without modifying the pre-trained model itself. The method provides theoretical convergence guarantees, indicating robust optimization properties.

For the machine learning community, this work has implications for deploying AI systems in production environments where reliability is paramount. OOD detection prevents models from making confident predictions on unfamiliar inputs, a critical safety feature. The state-of-the-art results across multiple benchmarks suggest this approach generalizes well. The research particularly impacts computer vision applications where unexpected inputs pose significant risks, including autonomous systems and medical imaging.

Key Takeaways
  • Text embeddings create an inherent modality gap with visual features that cannot be fixed through prompt engineering alone
  • Online pseudo-supervised learning can adapt visual prototypes using unlabeled test data while maintaining theoretical convergence guarantees
  • The method achieves state-of-the-art OOD detection without requiring access to in-distribution training data
  • Post-hoc prototype learning preserves pre-trained model integrity while improving detection reliability
  • This addresses a critical safety requirement for deploying vision models in production with unreliable or novel inputs
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles