AINeutralarXiv – CS AI · 3h ago6/10
🧠
BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law
Researchers introduce BenGER, a comprehensive benchmark dataset for evaluating large language models on German legal reasoning tasks, comprising 596 exam-style cases and 531 doctrinal reasoning problems. The study demonstrates that LLM-as-a-Judge frameworks can achieve near-human consistency in legal assessment, with human-AI collaboration substantially outperforming unaided human performance.