#benchmark-datasets News & Analysis

3 articles tagged with #benchmark-datasets. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Researchers develop strategies for extending large language models as evaluation tools to multilingual settings, addressing challenges in low-resource languages. The study reveals that fine-tuned smaller models match proprietary performance when in-domain data exists, while larger zero-shot models excel in out-of-domain scenarios, providing practical guidance for building multilingual evaluation systems.

AINeutralarXiv – CS AI · May 96/10

🧠

SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

Researchers introduce SCRuB, a novel evaluation framework for measuring how well large language models reason about social concepts—abstract ideas underlying norms, culture, and institutions. Testing frontier models against PhD-level experts on 4,711 prompts, the study finds AI models outperform human experts across all dimensions, with models preferred in 74.4% of comparative judgments, suggesting evaluation saturation in single-turn reasoning tasks.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

Researchers have developed RandSymKL, a debiasing technique for Bangla language models that mitigates gender bias in classification tasks like sentiment analysis and hate speech detection. The study introduces four manually annotated benchmark datasets with gender-perturbation testing and demonstrates that the approach effectively reduces bias while maintaining competitive accuracy compared to existing methods.