AINeutralarXiv – CS AI · 3h ago6/10
🧠
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
Researchers develop strategies for extending large language models as evaluation tools to multilingual settings, addressing challenges in low-resource languages. The study reveals that fine-tuned smaller models match proprietary performance when in-domain data exists, while larger zero-shot models excel in out-of-domain scenarios, providing practical guidance for building multilingual evaluation systems.