Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models
Researchers challenge the standard approach of using text embeddings as class prototypes in out-of-distribution detection with vision-language models, demonstrating a fundamental misalignment between text and visual feature spaces. They propose an online pseudo-supervised framework that learns visual prototypes directly from unlabeled test data, achieving state-of-the-art OOD detection performance.
This research addresses a critical gap in how pre-trained vision-language models handle out-of-distribution detection. The paper identifies that existing methods rely on text embeddings of class names as prototypes, but this approach inherently suffers from a modality gap—the visual and textual feature spaces are fundamentally misaligned, and prompt engineering alone cannot resolve this structural limitation.
The advancement builds on the recent success of vision-language models like CLIP in enabling zero-shot learning without access to training data. However, the authors demonstrate that this convenience comes at a cost: text-derived prototypes are suboptimal for visual classification tasks. The theoretical contribution proves that this gap exists intrinsically, shifting focus from engineering better prompts to fundamentally reconceptualizing how prototypes should be constructed.
The proposed solution leverages unlabeled test-time data streams through an online pseudo-supervised learning framework. This approach learns visual prototypes directly in the visual feature space while respecting the post-hoc constraint—meaning it works without modifying the pre-trained model itself. The method provides theoretical convergence guarantees, indicating robust optimization properties.
For the machine learning community, this work has implications for deploying AI systems in production environments where reliability is paramount. OOD detection prevents models from making confident predictions on unfamiliar inputs, a critical safety feature. The state-of-the-art results across multiple benchmarks suggest this approach generalizes well. The research particularly impacts computer vision applications where unexpected inputs pose significant risks, including autonomous systems and medical imaging.
- →Text embeddings create an inherent modality gap with visual features that cannot be fixed through prompt engineering alone
- →Online pseudo-supervised learning can adapt visual prototypes using unlabeled test data while maintaining theoretical convergence guarantees
- →The method achieves state-of-the-art OOD detection without requiring access to in-distribution training data
- →Post-hoc prototype learning preserves pre-trained model integrity while improving detection reliability
- →This addresses a critical safety requirement for deploying vision models in production with unreliable or novel inputs