The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment
Researchers introduce Prosecution Decision Prediction (PDP), a new legal AI benchmark that evaluates criminal liability assessment at the prosecutorial review stage rather than post-indictment. The study reveals that state-of-the-art large language models perform substantially worse on PDP tasks than traditional Legal Judgment Prediction, exposing significant gaps in AI's ability to evaluate evidence and apply legal discretion.
This research addresses a critical blind spot in legal AI evaluation by shifting focus upstream in the criminal justice process. While Legal Judgment Prediction has become the standard benchmark for assessing AI in criminal law, it only examines cases that already passed prosecutorial filtering—cases involving dismissed charges, insufficient evidence, or exempted liability remain invisible. The new PDP framework captures these previously unmeasured decisions through a dataset of 4,630 real Chinese prosecutorial decisions across 190 charges.
The findings reveal troubling limitations in current LLM capabilities. State-of-the-art models that perform well on post-indictment judgment tasks struggle significantly with PDP, indicating they cannot adequately evaluate evidentiary sufficiency or apply nuanced legal discretion. Mainstream enhancement techniques—including reinforcement learning from outcome rewards—fail to close this performance gap, suggesting the problem runs deeper than simple fine-tuning or prompt engineering can solve.
For the legal AI industry, these results highlight that benchmark selection directly shapes how we measure progress. A task that only examines prosecuted cases inherently overestimates AI readiness because it ignores the harder problem: deciding which cases merit prosecution. This has practical implications for jurisdictions considering AI-assisted prosecutorial review systems, as existing evaluations may not predict real-world performance.
The research underscores that legal reasoning requires capabilities beyond pattern matching on successful cases. Evidence evaluation, statutory interpretation, and discretionary judgment involve complex reasoning about incomplete information and competing legal values. Future legal AI development must address these gaps before deployment in actual prosecutorial decision-making contexts.
- →PDP benchmark reveals LLMs perform significantly worse on prosecutorial decisions than on post-indictment judgment prediction tasks
- →Standard legal AI evaluation excludes cases dismissed or rejected during prosecution, creating an incomplete assessment of AI capabilities
- →Current enhancement methods including outcome-based reinforcement learning fail to improve PDP model discrimination
- →Legal AI systems require capabilities in evidence evaluation and discretionary judgment that extend beyond existing benchmark domains
- →Jurisdictions considering AI-assisted prosecution review should not rely on traditional LJP benchmarks to predict real-world performance