From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
Researchers evaluated four major LLMs (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, Grok-1) on Vietnamese legal text simplification using a dual-aspect framework combining benchmarking metrics with expert-validated error analysis. The study reveals a critical trade-off: while some models excel at readability, they sacrifice legal accuracy, and high accuracy scores often mask subtle but serious reasoning errors that matter in legal contexts.