Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets
Researchers introduce ZeroSight, a new benchmark for Zero-Shot Composed Image Retrieval that addresses critical flaws in existing datasets by using video-sourced data published after CLIP's training cutoff and proposing SC4CIR, a training-free method that reveals current ZS-CIR performance metrics significantly overestimate actual model capabilities.
The ZeroSight benchmark addresses fundamental problems in how zero-shot composed image retrieval systems are evaluated and trained. Current datasets suffer from two critical issues: noisy source images create irrelevant reference-target pairs, and models are evaluated on data they've already encountered during pre-training, creating an illusion of generalization. By sourcing frames from single videos and using content published after March 31, 2022, the researchers ensure genuine zero-shot evaluation that reflects real-world performance expectations.
The research emerged from growing concerns about overstated AI capabilities in computer vision tasks. When evaluation datasets overlap with training data—a practice common in computer vision benchmarks—models appear more capable than they truly are. This creates problems for downstream applications relying on these systems for production use. The ZeroSight authors tested 27 existing methods and discovered that current benchmarks inflate performance metrics, suggesting the field has overestimated progress in composed image retrieval.
The SC4CIR method introduces practical value by using symmetric consistency checks to identify hard negative examples without requiring additional training. This plug-and-play approach addresses a real limitation: most current methods struggle with challenging negative cases that are visually similar to targets but don't match the query composition. For developers implementing image retrieval systems, this work signals that published benchmarks may not accurately predict production performance, necessitating more rigorous internal testing.
The broader implication extends beyond image retrieval to how AI systems are evaluated generally. This research contributes to a necessary reckoning about evaluation standards in computer vision and multimodal AI, potentially influencing how future benchmarks are constructed and validated across the field.
- →Existing zero-shot image retrieval datasets overestimate model performance by using training data CLIP has encountered
- →ZeroSight uses video-sourced frames from after CLIP's training cutoff to ensure genuine zero-shot evaluation scenarios
- →SC4CIR method improves performance through symmetric consistency checks without requiring model retraining
- →Benchmarking analysis of 27 methods reveals current metrics significantly exaggerate composed image retrieval capabilities
- →Video-based data construction ensures visually and semantically consistent reference-target pairs unavailable in public datasets