AIBearisharXiv – CS AI · 10h ago7/10
🧠
Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity
Researchers reveal that multimodal language models used as judges fail to fairly evaluate culturally ambiguous content, exhibiting calibration and orientation biases when assessed against diverse human annotators. The study demonstrates these models systematically favor one cultural perspective while compressing their scoring scales, with implications for any AI system deployed across cultural contexts.