AIBearisharXiv – CS AI · 7h ago7/10
🧠
On the Limits of LLM-as-Judge for Scientific Novelty Assessment
Researchers demonstrate that Large Language Models systematically overestimate the novelty of AI-generated research questions compared to human expert assessment, revealing a critical gap in LLM-based scientific evaluation. The study introduces RQ-Bench, a benchmark showing that while LLMs rate model-generated questions as highly novel, domain experts prefer author-anchored reference questions and identify that many AI-generated questions lack depth or originality.