UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
Researchers introduced UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning using 99.5 million court decisions. The study reveals critical gaps in LLM evaluation for morphologically rich, non-Latin-script languages and demonstrates that standard accuracy metrics mask poor performance on imbalanced legal tasks.
UA-Legal-Bench addresses a significant blind spot in AI evaluation: the overwhelming focus on English-centric benchmarks masks how poorly LLMs perform on morphologically complex, non-Latin-script languages like Ukrainian. By leveraging the EDRSR's 99.5 million court decisions, researchers created a comprehensive five-task evaluation framework spanning case classification, outcome prediction, and legal norm extraction. This represents a meaningful contribution to understanding LLM limitations in underrepresented linguistic domains.
The benchmark's findings carry important methodological implications. The research exposes how accuracy metrics fundamentally mislead when applied to imbalanced datasets: a model achieving 62% accuracy on case-outcome prediction was merely predicting majority classes while posting only 23% macro-F1, whereas genuinely capable models scored 44% macro-F1. This demonstrates that practitioners evaluating LLMs for legal applications must adopt more sophisticated evaluation metrics beyond raw accuracy. The sharp task-dependent few-shot effects—with improvements ranging from negligible to +38.6 percentage points—suggest that prompt engineering effectiveness varies unpredictably across legal reasoning tasks.
The scaling analysis reveals fragmented performance patterns across model families: while some 8B parameter models matched frontier model performance on surface-level tasks, scaling thresholds varied dramatically, indicating that larger isn't universally better. For organizations deploying LLMs in legal domains across non-English markets, this work signals the inadequacy of existing evaluation frameworks and the risk of relying on models validated solely on English benchmarks. The open release of data, prompts, and predictions enables reproducible research and sets a precedent for developing language-specific legal benchmarks.
- →English-centric LLM benchmarks systematically fail to detect failure modes in morphologically rich, non-Latin-script languages
- →Standard accuracy metrics are misleading on imbalanced legal tasks; macro-F1 and task-specific metrics are essential
- →Few-shot prompting effects vary dramatically by task, with improvements ranging from negligible to 38.6 percentage points
- →Smaller models (8B parameters) can match frontier performance on surface-level legal tasks but scaling patterns differ across model families
- →Evaluation frameworks for legal LLM applications must incorporate language-specific benchmarks and balanced datasets