AIBullisharXiv โ CS AI ยท 6h ago3
๐ง
A Mixed Diet Makes DINO An Omnivorous Vision Encoder
Researchers have developed an 'Omnivorous Vision Encoder' that creates consistent feature representations across different visual modalities (RGB, depth, segmentation) of the same scene. The framework addresses the poor cross-modal alignment in existing vision encoders like DINOv2 by training with dual objectives to maximize feature alignment while preserving discriminative semantics.