#peer-review News & Analysis

28 articles tagged with #peer-review. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

28 articles

AIBearishDecrypt · Jun 247/10

🧠

Researcher Throws Cold Water on Microsoft Quantum Claims

A physicist has published a formal critique challenging Microsoft's claims about successfully demonstrating the topological qubit technology underlying its Majorana 2 quantum chip. The critique raises questions about whether Microsoft has achieved the scientific breakthrough it announced, with potential implications for the company's quantum computing roadmap and investor confidence.

AIBearisharXiv – CS AI · Jun 107/10

🧠

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

Researchers demonstrate that AI-assisted peer review systems are vulnerable to simple adversarial attacks, with superficial abstract rephrasing increasing acceptance ratings by up to 1.31 points on a 10-point scale without changing underlying scientific content. The low-cost manipulation ($1, 5 minutes) reveals systemic risks in AI-mediated scientific evaluation and raises concerns about authors optimizing for algorithmic judgment rather than merit.

🧠 GPT-5🧠 Gemini

AIBearisharXiv – CS AI · May 297/10

🧠

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Researchers evaluated LLM-generated peer reviews for scientific papers using ACL Rolling Review data, finding limited alignment between LLM and human reviews while discovering that authors can strategically game LLM feedback to improve paper scores by up to 35%. The study highlights emerging risks in automated academic review systems as both reviewers and authors increasingly leverage language models.

AINeutralarXiv – CS AI · May 297/10

🧠

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

Researchers introduce PRAIB, a benchmark framework that evaluates how Large Language Models perform peer review compared to human reviewers. Analysis of 11,000 LLM-generated reviews across major AI conferences reveals significant behavioral divergences: LLM ratings show less variability, positive bias, overconfidence, and frequently miss atomic weaknesses that human reviewers catch.

AIBullisharXiv – CS AI · May 277/10

🧠

E3: Issue-Level Backtesting for Automated Research Critique

Researchers introduce E3, an automated review assistant that identifies technical concerns in research papers with 90.2% recall—outperforming human reviewers and leading AI models. The system detects unsupported claims, missing ablations, weak baselines, and validity threats, with evaluation conducted on 100 ICLR 2026 papers using a contamination-resistant backtesting protocol.

🏢 OpenAI🏢 Anthropic🧠 GPT-5

AIBearisharXiv – CS AI · May 127/10

🧠

Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents

A new threat called Agentic Denominator Gaming could exploit AI conferences' stable acceptance rates by flooding submissions with low-quality papers generated by automated agents, inflating the denominator to boost legitimate papers' acceptance odds without intending publication of the spam itself. This systemic vulnerability exposes academic peer review to coordinated attacks that would degrade review quality and increase reviewer burnout while requiring institutional policy reforms beyond technical solutions.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics

Researchers demonstrate an autonomous LLM agent capable of executing a complete research loop—reading, reproducing, critiquing, and extending computational physics papers. Testing across 111 papers reveals the agent identifies substantive flaws in 42% of cases, with 97.7% of issues requiring actual computation to detect, and produces a publishable peer-review comment on a Nature Communications paper without human direction.

AIBullisharXiv – CS AI · Mar 46/102

🧠

APRES: An Agentic Paper Revision and Evaluation System

Researchers have developed APRES, an AI-powered system that uses Large Language Models to automatically revise scientific papers based on evaluation rubrics that predict citation counts. The system improves citation prediction accuracy by 19.6% and produces paper revisions that human experts prefer 79% of the time over original versions.

AIBearisharXiv – CS AI · Mar 46/102

🧠

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Researchers developed a method to detect AI-generated content at scale and found that 6.5-16.9% of peer reviews at major AI conferences after ChatGPT's release were substantially modified by LLMs. The study reveals concerning patterns where AI-generated reviews correlate with lower reviewer confidence, last-minute submissions, and reduced engagement in rebuttals.

AINeutralarXiv – CS AI · Jun 256/10

🧠

ReviewGuard: Aligning LLM-Assisted Peer Review with Long-Term Scientific Impact

Researchers introduce ReviewGuard, an LLM-based framework that predicts long-term scientific impact rather than mimicking human peer reviewers. Testing on 20,861 AI/ML papers shows ReviewGuard correlates 5.6x better with future citations than human reviewers and identifies high-impact rejected papers at significantly higher rates, suggesting AI can complement editorial decision-making without replacing human judgment.

AIBearishCrypto Briefing · Jun 256/10

🧠

Microsoft’s quantum computing claims face new scrutiny from Nature critique

Microsoft faces credibility challenges in quantum computing following a critique published in Nature, raising questions about the rigor and transparency of the company's scientific claims. The scrutiny highlights the importance of independent peer review and validation in emerging technology fields.

AINeutralarXiv – CS AI · Jun 236/10

🧠

FirstPass: Grounding AI Scientific Judgment in Multi-Round Editorial Outcomes

Researchers introduce FirstPass, a dataset and fine-tuned AI model that significantly improves peer-review prediction by training on 3,668 multi-round editorial dialogues from Nature Communications across five scientific domains. The model achieves 80.5% accuracy in predicting editorial outcomes, outperforming existing systems by grounding AI judgment in real iterative peer-review processes rather than stylistic mimicry.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 236/10

🧠

PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality

Researchers introduce PeerCheck, a framework that analyzes differences between LLM-generated and human-written academic reviews, finding that LLMs prioritize theoretical aspects while humans emphasize methodology. Using techniques like Chain-of-Thought prompting improves LLM review quality, though retrieval-augmented generation surprisingly produces inconsistent and sometimes degraded results.

AINeutralarXiv – CS AI · Jun 235/10

🧠

Rebuttals Move Peer-Review Scores, but Initial-Review Structure Bounds the Movement

Researchers analyzed 73,000 reviewer trajectories from ICLR 2024-2025 to measure how author rebuttals affect peer-review scores. Using LLMs as measurement tools, they found that while rebuttals can move scores, initial review structure predicts most score movement, constraining rebuttal impact to measurable but bounded effects.

🧠 Claude🧠 Opus🧠 Gemini

AINeutralarXiv – CS AI · Jun 196/10

🧠

Benchmarking Agentic Review Systems

Researchers benchmarked AI-powered peer review systems across multiple models and datasets, finding that the best configurations achieve 83% accuracy in ranking papers by quality and catch 71.6% of intentionally injected errors. While AI review systems show promise in tracking human quality judgments and earning positive user feedback, they still require substantial improvement before serving as primary peer review mechanisms.

🧠 GPT-5

AINeutralarXiv – CS AI · Jun 106/10

🧠

Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem

A position paper argues that the machine learning community must develop an AI-augmented peer-review ecosystem to address the crisis of scale in scientific publishing. With manuscript submissions exponentially outpacing qualified reviewers at premier ML venues, the authors propose using LLMs as collaborators—not replacements—to enhance factual verification, reviewer performance, author quality improvement, and administrative decision-making while maintaining scientific integrity.

AI × CryptoNeutralarXiv – CS AI · Jun 96/10

🤖

Traxia: A Framework for Verifiable, Agent-Native Scientific Publishing

Traxia proposes an agent-native scientific publishing framework that enforces verifiability, attribution, and reproducibility by treating AI agents as first-class participants with cryptographic identities, reasoning traces, and immutable contribution logs. The system combines peer review, reputation staking, and blockchain-like provenance mechanisms to address reproducibility failures and research transparency, though the paper presents only architectural specifications without empirical validation.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Can AI Refute Economic Theory? Evidence from Beyond the Knowledge Cutoff

A research study evaluates whether current AI models can independently identify errors in published economic theory papers. The analysis finds that while AI-human collaboration can enhance peer review, no AI model successfully detected genuine errors without substantial human guidance, indicating significant limitations in AI's ability to advance theoretical knowledge autonomously.

🧠 ChatGPT🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · May 276/10

🧠

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

Researchers introduce TADDLE, an AI system that detects quality deficiencies in LLM-generated peer reviews by decomposing analysis into specialized tools and multi-label classification. The work addresses a growing problem in academic publishing where AI-written reviews are fluent but potentially flawed, backed by the first expert-annotated benchmark of 1,800 reviews across six defect categories.

AINeutralarXiv – CS AI · May 116/10

🧠

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

Researchers introduce CoCoReviewBench, a new benchmark dataset of 3,900 papers from ICLR and NeurIPS designed to reliably evaluate AI review systems. The benchmark addresses critical gaps in current evaluation methods by prioritizing correctness over mere overlap with human reviews, revealing that existing AI reviewers struggle with hallucinations and reasoning accuracy.

AINeutralarXiv – CS AI · May 96/10

🧠

Shattering the Echo Chamber: Hidden Safeguards in Manuscripts Against the AI Takeover of Peer Review

Researchers propose IntraGuard, a defense framework that embeds hidden safeguards into PDF manuscripts to detect when AI chatbots are used to generate peer reviews instead of human experts. The system achieves 84% success rate in disrupting AI-generated reviews while maintaining transparency for legitimate human reviewers, addressing growing concerns about academic integrity as LLMs proliferate.

AINeutralarXiv – CS AI · May 16/10

🧠

Can AI Be a Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and the Future

A comprehensive survey examines how large language models can assist or automate peer review processes across academia, synthesizing techniques for review generation, post-review tasks, and evaluation methods. The research catalogs datasets and modeling approaches while addressing ethical concerns and practical implementation challenges for integrating AI into scholarly publishing workflows.

AIBullisharXiv – CS AI · Apr 156/10

🧠

GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

Researchers introduce GoodPoint, an AI system trained to generate constructive scientific feedback by learning from author responses to peer review. The method improves feedback quality by 83.7% over baseline models and outperforms larger LLMs like Gemini-3-flash, demonstrating that specialized training on valid, actionable feedback signals yields better results than general-purpose models.

🧠 Gemini

AINeutralarXiv – CS AI · Apr 146/10

🧠

NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

Researchers introduced NovBench, the first large-scale benchmark for evaluating how well large language models can assess research novelty in academic papers. The benchmark comprises 1,684 paper-review pairs from a leading NLP conference and reveals that current LLMs struggle with scientific novelty comprehension despite promise in peer review support.

AINeutralarXiv – CS AI · Apr 76/10

🧠

FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

Researchers introduce FactReview, an AI system that improves academic peer review by combining claim extraction, literature positioning, and code execution to verify research claims. The system addresses weaknesses in current LLM-based reviewing by grounding assessments in external evidence rather than relying solely on manuscript narratives.

$MKR

Page 1 of 2Next →