←Back to feed
🧠 AI⚪ NeutralImportance 6/10
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
🤖AI Summary
A research study reveals that AI model performance rankings change dramatically based on the evaluation language used, with GPT-4o performing best in English while Gemini leads in Arabic and Hindi. The study tested 55 development tasks across five languages and six AI models, showing no single model dominates across all languages.
Key Takeaways
- →AI model rankings can completely invert depending on the evaluation language used, challenging English-centric benchmarking.
- →GPT-4o achieved highest satisfaction in English (44.72%) while Gemini led in Arabic (51.72%) and Hindi (53.22%).
- →Inter-model agreement on individual requirement judgments remains modest across all tested languages.
- →Localizing judge-side instructions proved crucial, with Hindi satisfaction dropping from 42.8% to 23.2% under partial localization.
- →The study tested 4950 judge runs across five typologically diverse languages and six major AI models.
Mentioned in AI
Models
GPT-4OpenAI
GeminiGoogle
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Related Articles