TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews
Researchers introduce TADDLE, an AI system that detects quality deficiencies in LLM-generated peer reviews by decomposing analysis into specialized tools and multi-label classification. The work addresses a growing problem in academic publishing where AI-written reviews are fluent but potentially flawed, backed by the first expert-annotated benchmark of 1,800 reviews across six defect categories.
The proliferation of large language models in academic peer review creates a hidden quality problem: LLM-generated reviews read smoothly and professionally while potentially containing substantive defects that human reviewers would catch. TADDLE tackles this by treating review quality assessment as a specialized detection problem rather than relying on generic text classification approaches. The system's architecture—using specialized analysis tools for verification, correction, completion, and transformation—mirrors how human experts would manually audit reviews, suggesting the researchers understand review evaluation requires domain-specific reasoning.
This work emerges as major academic conferences grapple with increased LLM-generated submissions and reviews. The ICLR 2025 benchmark represents real-world review data, making findings immediately relevant to venues struggling with quality control. The multi-label classification approach acknowledges that reviews can fail in multiple ways simultaneously—a paper might have correct technical assessment but incomplete novelty discussion—reflecting actual review complexity.
For the broader research community, TADDLE signals that automated review quality assurance is becoming essential infrastructure rather than optional tooling. Publishers and conference organizers face pressure to implement detection systems before LLM-generated reviews systematically degrade peer review quality. The released benchmark enables other teams to build competing systems, potentially sparking a detection-evasion arms race where review-generation models improve to fool detectors.
The significance extends beyond academic publishing. As AI systems increasingly handle critical evaluation tasks—whether code review, medical imaging analysis, or legal assessment—mechanisms for detecting systematic failures in AI-generated professional work become foundational. TADDLE demonstrates that tool-augmented agents can effectively decompose complex quality assessment problems, a pattern applicable across professional domains.
- →TADDLE uses specialized analysis tools orchestrated by an agent to detect six categories of defects in AI-generated peer reviews with greater accuracy than generic classification methods.
- →The first expert-annotated benchmark of 1,800 real peer reviews from ICLR 2025 provides concrete data showing LLM reviews can appear fluent while containing substantive deficiencies.
- →Multi-label classification reveals reviews often fail in multiple ways simultaneously, requiring detection systems to identify specific defect types rather than binary quality judgments.
- →Academic venues increasingly need automated quality assurance systems as LLM-generated reviews become common, creating demand for detection and audit tools.
- →The tool-augmented agent architecture demonstrates that decomposing complex evaluation tasks into specialized components outperforms monolithic approaches to professional quality assessment.