🧠 AI⚪ NeutralImportance 6/10

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

arXiv – CS AI|Ramon Pires, Thales Sales Almeida, Celio Larcher Junior, Giovana Bon\'as, Hugo Abonizio, Marcos Piau, Roseval Malaquias Junior, Thiago Laitz, Rodrigo Nogueira|May 12, 2026 at 04:00 AM

🤖AI Summary

Researchers introduced Magis-Bench, a new benchmark for evaluating large language models on magistrate-level judicial tasks based on Brazilian competitive exams. Testing 23 state-of-the-art LLMs revealed that even top performers like Google's Gemini-3-Pro-Preview score below 70% on complex legal reasoning and judicial writing tasks, indicating significant gaps in AI legal capabilities.

Analysis

Magis-Bench addresses a critical gap in legal AI evaluation by focusing on judicial reasoning rather than legal advocacy. Existing benchmarks typically assess whether LLMs can produce legal arguments or documents, but this new benchmark evaluates a fundamentally different capability: the ability to adjudicate disputes by weighing competing claims, applying legal doctrine, and rendering reasoned decisions. This distinction matters because judging requires synthesis, discretion, and contextual understanding that differ substantially from document generation.

The benchmark's design reflects a sophisticated understanding of legal practice. Derived from Brazilian judicial competitive examinations administered between 2023 and 2025, the 74 questions include multi-turn discursive analysis and practical exercises requiring complete civil and criminal sentences. The evaluation methodology employs four independent frontier models as judges, achieving remarkable inter-rater agreement (Kendall's W = 0.984), which validates the assessment approach and strengthens confidence in results.

Results reveal both promise and persistent limitations. Google's Gemini-3-Pro-Preview leads at 6.97/10, with Claude-4.5-Opus at 6.46/10, demonstrating that current frontier models possess meaningful judicial reasoning capacity. However, the sub-70% performance ceiling indicates that judicial-level legal tasks remain substantively difficult for current systems. This has immediate implications for legal technology development, suggesting that AI-assisted judicial decision-making remains premature without significant capability improvements.

The benchmark's public release with complete model outputs and evaluation code supports reproducible research in legal AI. Developers can now rigorously assess judicial reasoning capabilities, potentially driving focused improvements in this specialized domain. The work establishes both a measurement baseline and clear performance targets for future model development.

Key Takeaways

→Magis-Bench introduces the first benchmark specifically designed to evaluate LLMs on judicial reasoning tasks rather than legal advocacy, using 74 questions from Brazilian judicial exams.
→Top-performing models like Gemini-3-Pro-Preview achieve only 6.97/10, indicating judicial-level legal reasoning remains substantially challenging for current LLMs.
→The evaluation methodology achieved exceptionally high inter-rater agreement (Kendall's W = 0.984) using four frontier models, validating the assessment approach.
→Current performance levels suggest AI-assisted judicial decision-making would be premature without significant capability improvements in legal reasoning.
→Public benchmark release with code and outputs enables reproducible research for advancing legal AI capabilities in specialized domains.

Mentioned in AI

Models

ClaudeAnthropic

GeminiGoogle

#legal-ai #llm-evaluation #judicial-reasoning #benchmark #legal-tech #ai-capabilities #natural-language-processing

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AI5d ago