y0news
#ai-assessment4 articles
4 articles
AIBullisharXiv โ€“ CS AI ยท 4h ago2
๐Ÿง 

Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation

Researchers introduce M-JudgeBench, a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) used as judges, and propose Judge-MCTS framework to improve judge model training. The work addresses systematic weaknesses in existing MLLM judge systems through capability-oriented evaluation and enhanced data generation methods.

AINeutralarXiv โ€“ CS AI ยท 4h ago1
๐Ÿง 

A Unified Framework to Quantify Cultural Intelligence of AI

Researchers have developed a unified framework to systematically measure the cultural intelligence of AI systems as generative AI technologies expand globally. The framework addresses the need for comprehensive assessment of AI's ability to operate across diverse cultural contexts, moving beyond fragmented evaluation approaches to provide a systematic methodology for measuring cultural competence.

AINeutralarXiv โ€“ CS AI ยท 4h ago1
๐Ÿง 

MOSAIC: Unveiling the Moral, Social and Individual Dimensions of Large Language Models

Researchers introduce MOSAIC, the first comprehensive benchmark to evaluate moral, social, and individual characteristics of Large Language Models beyond traditional Moral Foundation Theory. The benchmark includes over 600 curated questions and scenarios from nine validated questionnaires and four platform-based games, providing empirical evidence that current evaluation methods are insufficient for assessing AI ethics comprehensively.

AINeutralarXiv โ€“ CS AI ยท 4h ago1
๐Ÿง 

Measuring What AI Systems Might Do: Towards A Measurement Science in AI

Researchers argue that current AI evaluation methods fail to properly measure true AI capabilities and propensities, which should be treated as dispositional properties. The paper proposes a more scientific framework for AI evaluation that requires mapping causal relationships between contextual conditions and behavioral outputs, moving beyond simple benchmark averages.