y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-judge News & Analysis

2 articles tagged with #llm-judge. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AINeutralarXiv – CS AI · May 296/10
🧠

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Researchers demonstrate that deep literature search pipelines dramatically improve retrieval performance (from ~20% to 80% recall) compared to basic API searches, while simultaneously revealing that human citation lists contain significant bias and are unsuitable as ground truth for evaluation. The study advocates for multi-dimensional evaluation metrics beyond simple recall to assess citation quality accurately.

AINeutralarXiv – CS AI · Mar 36/103
🧠

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

Researchers introduced WebDevJudge, a benchmark for evaluating how well AI models can judge web development quality compared to human experts. The study reveals significant gaps between AI judges and human evaluation, highlighting fundamental limitations in AI's ability to assess complex, interactive web development tasks.