PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing
Researchers introduce PRAIB, a benchmark framework that evaluates how Large Language Models perform peer review compared to human reviewers. Analysis of 11,000 LLM-generated reviews across major AI conferences reveals significant behavioral divergences: LLM ratings show less variability, positive bias, overconfidence, and frequently miss atomic weaknesses that human reviewers catch.
The peer review process, fundamental to scientific validation, faces mounting pressure as submission volumes surge. This research directly addresses whether LLMs can genuinely augment human reviewers or merely simulate review-like outputs. The PRAIB framework provides quantifiable metrics for specificity, style, and engagement patterns, moving beyond subjective assessment.
The empirical study examined 11,000 reviews across ICLR and NeurIPS papers spanning 2021-2025, comparing outputs from five proprietary and open-source models against original human reviews. This temporal breadth captures model evolution while maintaining consistency in evaluation methodology. The findings reveal systemic limitations: LLMs generate longer, more syntactically complex reviews that paradoxically demonstrate shallower analysis than human counterparts.
The systematic biases identified—positive rating bias, overconfidence, and inconsistent cross-referencing—suggest LLMs pattern-match against training data rather than engage substantively with manuscript content. This has immediate implications for institutions considering LLM-assisted review workflows. Deploying models with these behavioral signatures could accelerate decision-making while degrading review quality, creating false efficiency.
The research establishes that LLMs currently excel at scale and speed metrics but fail at detecting specific technical weaknesses. Organizations considering LLM integration must recognize that current models function poorly as primary reviewers but may support administrative tasks or initial filtering. PRAIB provides diagnostic capability to measure when models improve sufficiently for deployment in higher-stakes contexts, establishing empirical guardrails rather than aspirational claims about AI in academia.
- →LLMs generate longer, more complex reviews but systematically miss the specific technical weaknesses human reviewers identify.
- →Machine-generated ratings show lower variability and positive bias compared to human reviewers, suggesting unreliable assessment signals.
- →The PRAIB benchmark framework enables quantifiable measurement of review quality across specificity, style, and engagement dimensions.
- →Cross-reference patterns in LLM reviews diverge from human norms and vary unpredictably across different models.
- →Current LLMs are unsuitable as primary peer reviewers but may support administrative functions pending significant behavioral improvements.