AIBullisharXiv – CS AI · 6h ago7/10
🧠
DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
Researchers introduce DINORANKCLIP, an advanced vision-language pretraining framework that improves upon CLIP by incorporating DINOv3 distillation and high-order ranking consistency. The method addresses fundamental limitations in contrastive learning by preserving fine-grained visual details and implementing a third-order Plackett-Luce ranking model, achieving consistent improvements across benchmarks with modest computational requirements.