Human-like Object Grouping in Self-supervised Vision Transformers
Researchers developed a behavioral benchmark showing that self-supervised vision transformers, particularly those trained with DINO objectives, align closely with human object perception and segmentation behavior. The study found that models with stronger object-centric representations better predict human visual judgments, with Gram matrix structure playing a key role in perceptual alignment.