AINeutralarXiv – CS AI · 7h ago6/10
🧠
Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated
Researchers argue that benchmarking vision-language models for urban perception tasks must account for human disagreement and measurement reliability rather than treating consensus as ground truth. A study of seven VLMs evaluated on 100 Montreal street scenes reveals that model performance correlates with inter-annotator reliability, highlighting the need for transparent uncertainty reporting in AI evaluation frameworks.