AINeutralarXiv – CS AI · 10h ago6/10
🧠
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Researchers introduced Magis-Bench, a new benchmark for evaluating large language models on magistrate-level judicial tasks based on Brazilian competitive exams. Testing 23 state-of-the-art LLMs revealed that even top performers like Google's Gemini-3-Pro-Preview score below 70% on complex legal reasoning and judicial writing tasks, indicating significant gaps in AI legal capabilities.
🧠 Claude🧠 Gemini