Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated
Researchers argue that benchmarking vision-language models for urban perception tasks must account for human disagreement and measurement reliability rather than treating consensus as ground truth. A study of seven VLMs evaluated on 100 Montreal street scenes reveals that model performance correlates with inter-annotator reliability, highlighting the need for transparent uncertainty reporting in AI evaluation frameworks.
Vision-language models are increasingly deployed in real-world applications that affect urban planning and policy decisions, yet current benchmarking practices often ignore a fundamental measurement problem: human annotators frequently disagree on urban perception tasks. This research exposes a critical gap between how AI systems are typically evaluated and how they should be assessed when outputs inform governance decisions. The Montreal street scene study demonstrates that when 12 annotators from different community organizations independently label 100 urban scenes across 30 dimensions, both explicit disagreement and non-response rates vary significantly by dimension. Importantly, model alignment with 'human consensus' tracks closely with inter-annotator reliability—dimensions where humans agree more yield better model-human agreement, while low human reliability correlates with weaker model performance. This finding reframes the benchmarking problem entirely. Rather than pursuing a false consensus, the research advocates treating disagreement as valid measurement data that should be reported alongside model outputs. For appraisal dimensions like 'Overall Impression,' models and human annotators show distributional mismatches, including different rates of abstention, suggesting fundamental differences in how they approach subjective judgment. The implications extend beyond academic rigor. Urban governance decisions increasingly rely on data-driven insights, and VLM-generated descriptions influence streetscape auditing, mapping initiatives, and public consultation processes. Deploying models without acknowledging measurement uncertainty risks legitimizing particular viewpoints while obscuring legitimate disagreement. The research calls for benchmark creators to make assumptions visible, model developers to report uncertainty explicitly, and institutions to negotiate label spaces and scoring policies with affected communities rather than treating them as technical artifacts.
- →Vision-language models for urban perception should be evaluated against human reliability metrics, not just consensus accuracy.
- →Inter-annotator disagreement and non-response are valid measurement outcomes that reveal important limitations in both human and model judgment.
- →Model performance on urban perception tasks correlates directly with human consensus reliability on those same dimensions.
- →Current VLM benchmarking practices obscure uncertainty and measurement assumptions, which is problematic when outputs inform policy decisions.
- →Benchmark design for governance-relevant AI requires negotiation with community stakeholders, not solely technical optimization.