AINeutralarXiv – CS AI · 6h ago6/10
🧠
ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models
Researchers introduced ROSE, a benchmark that evaluates how well multimodal language models can convert visual information into context-specific actions. Testing nine MLLMs revealed significant performance drops of up to 44.5 percentage points when shifting from counting tasks to region-conditioned actions, despite near-perfect human performance, indicating a fundamental gap in how these models translate perception into actionable outputs.