Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms
Researchers test whether vision-language models exhibit human-like visual search behaviors using reasoning tokens as a proxy for cognitive effort. The study finds VLMs reproduce some human signatures—like increased effort in conjunction search—but diverge significantly in others, suggesting reasoning tokens offer a novel lens for understanding machine visual cognition.
This research introduces a methodological innovation by treating reasoning tokens as a cognitive effort metric comparable to human reaction time in visual search tasks. The comparison across four classic psychophysical paradigms reveals that frontier VLMs demonstrate surprisingly human-like search characteristics in feature-detection tasks, suggesting that underlying architectural principles may converge on similar solutions for parallel processing. However, meaningful divergences emerge: VLMs show reversed effort slopes between target-present and target-absent conditions, maintain enumeration accuracy beyond human limits, and exhibit unpredictable deliberation patterns across model tiers.
The findings highlight a critical gap between human visual cognition and current machine implementations. While VLMs succeed at simpler search tasks, their failures in complex scenarios and inconsistent reasoning allocation patterns indicate that vision-language processing operates through fundamentally different mechanisms than human attention. The observation that adaptive reasoning models sometimes abandon deliberation entirely on detection tasks reveals a surprising brittleness in how current systems allocate computational resources.
For AI development, this work establishes psychophysical paradigms as a cost-effective benchmarking tool for visual cognition without requiring extensive training data or specialized evaluation frameworks. The ability to distinguish genuine search processes from resolution difficulty using simple controls provides researchers with granular diagnostic capabilities. The research suggests that alignment with human-like visual attention patterns may be neither necessary nor desirable for all tasks, but understanding these divergences proves essential for deploying VLMs in applications requiring human-compatible reasoning patterns.
- →Vision-language models reproduce human feature-search signatures but diverge on conjunction search effort slopes
- →Reasoning tokens serve as a viable within-model proxy for cognitive effort and search difficulty
- →Frontier models maintain accuracy on visual search tasks where mid-tier models collapse to chance performance
- →Adaptive reasoning models sometimes decline deliberation on detection tasks entirely, creating unpredictable behavior
- →Psychophysical paradigms offer an inexpensive, generalizable methodology for probing machine visual cognition