AINeutralarXiv โ CS AI ยท 5h ago6/10
๐ง
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
A research study reveals that AI model performance rankings change dramatically based on the evaluation language used, with GPT-4o performing best in English while Gemini leads in Arabic and Hindi. The study tested 55 development tasks across five languages and six AI models, showing no single model dominates across all languages.
๐ง GPT-4๐ง Gemini