y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-evaluation News & Analysis

58 articles tagged with #llm-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

58 articles
AINeutralarXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

Researchers introduce a new framework to evaluate how well Large Language Models understand their own knowledge limitations, finding that traditional confidence metrics miss key differences between models. The study reveals that models showing similar accuracy can have vastly different metacognitive abilities - their capacity to know what they don't know.

๐Ÿง  Llama
AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

Qworld: Question-Specific Evaluation Criteria for LLMs

Researchers introduce Qworld, a new method for evaluating large language models that generates question-specific criteria using recursive expansion trees instead of static rubrics. The approach covers 89% of expert-authored criteria and reveals capability differences across 11 frontier LLMs that traditional evaluation methods miss.

AIBearisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

A new research study reveals that AI judges used to evaluate the safety of large language models perform poorly when assessing adversarial attacks, often degrading to near-random accuracy. The research analyzed 6,642 human-verified labels and found that many attacks artificially inflate their success rates by exploiting judge weaknesses rather than generating genuinely harmful content.

AINeutralarXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Research reveals that large language models used as judges for scoring responses show misleading performance when evaluated by global correlation metrics versus actual best-of-n selection tasks. A study using 5,000 prompts found that judges with moderate global correlation (r=0.47) only captured 21% of potential improvement, primarily due to poor within-prompt ranking despite decent overall agreement.

AINeutralarXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

Researchers developed Budget-Constrained Agentic Search (BCAS) to evaluate how search depth, retrieval strategies, and token budgets affect accuracy and cost in AI search systems. The study found that hybrid retrieval methods with lightweight re-ranking produce the largest gains, with accuracy improving up to a small cap of additional searches.

AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

Researchers propose a schema-gated orchestration approach to resolve the trade-off between conversational flexibility and deterministic execution in AI-driven scientific workflows. Their analysis of 20 systems reveals no current solution achieves both high flexibility and determinism, but identifies a convergence zone for potential breakthrough architectures.

AIBearisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs

Researchers developed a new framework to assess moral competence in large language models, finding that current evaluations may overestimate AI moral reasoning capabilities. While LLMs outperformed humans on standard ethical scenarios, they performed significantly worse when required to identify morally relevant information from noisy data.

AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

Researchers introduce KramaBench, a comprehensive benchmark testing AI systems' ability to execute end-to-end data processing pipelines on real-world data lakes. The study reveals significant limitations in current AI systems, with the best performing system achieving only 55% accuracy in full data-lake scenarios and leading LLMs implementing just 20% of individual data tasks correctly.

AINeutralarXiv โ€“ CS AI ยท Mar 66/10
๐Ÿง 

Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

Researchers introduce ICR (Inductive Conceptual Rating), a new qualitative metric for evaluating meaning in large language model text summaries that goes beyond simple word similarity. The study found that while LLMs achieve high linguistic similarity to human outputs, they significantly underperform in semantic accuracy and capturing contextual meanings.

AIBullisharXiv โ€“ CS AI ยท Mar 36/108
๐Ÿง 

Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation

Researchers introduce M-JudgeBench, a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) used as judges, and propose Judge-MCTS framework to improve judge model training. The work addresses systematic weaknesses in existing MLLM judge systems through capability-oriented evaluation and enhanced data generation methods.

AINeutralarXiv โ€“ CS AI ยท Mar 36/108
๐Ÿง 

Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

Researchers introduce IRIS Benchmark, the first comprehensive evaluation framework for measuring fairness in Unified Multimodal Large Language Models (UMLLMs) across both understanding and generation tasks. The benchmark integrates 60 granular metrics across three dimensions and reveals systemic bias issues in leading AI models, including 'generation gaps' and 'personality splits'.

AINeutralarXiv โ€“ CS AI ยท Mar 36/107
๐Ÿง 

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

Researchers introduced Pencil Puzzle Bench, a new framework for evaluating large language model reasoning capabilities using constraint-satisfaction problems. The benchmark tested 51 models across 300 puzzles, revealing significant performance improvements through increased reasoning effort and iterative verification processes.

AIBullisharXiv โ€“ CS AI ยท Mar 37/107
๐Ÿง 

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Researchers introduce CARE, a new framework for improving LLM evaluation by addressing correlated errors in AI judge ensembles. The method separates true quality signals from confounding factors like verbosity and style preferences, achieving up to 26.8% error reduction across 12 benchmarks.

AINeutralarXiv โ€“ CS AI ยท Mar 36/107
๐Ÿง 

The Value Sensitivity Gap: How Clinical Large Language Models Respond to Patient Preference Statements in Shared Decision-Making

A research study evaluated how four major large language models (GPT-5.2, Claude 4.5 Sonnet, Gemini 3 Pro, and DeepSeek-R1) respond to patient preferences in clinical decision-making scenarios. While all models acknowledged patient values, they showed modest actual recommendation shifting with value sensitivity indices ranging from 0.13 to 0.27, revealing gaps in how AI systems incorporate patient preferences into medical recommendations.

AIBullisharXiv โ€“ CS AI ยท Mar 36/107
๐Ÿง 

Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

Researchers introduce Autorubric, an open-source Python framework that standardizes rubric-based evaluation of large language models (LLMs) for text generation assessment. The framework addresses scattered evaluation techniques by providing a unified solution with configurable criteria, multi-judge ensembles, bias mitigation, and reliability metrics across three evaluation benchmarks.

AIBearisharXiv โ€“ CS AI ยท Mar 36/107
๐Ÿง 

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Researchers created PanCanBench, a comprehensive benchmark evaluating 22 large language models on pancreatic cancer-related patient questions, revealing significant variations in clinical accuracy and high hallucination rates. The study found that even top-performing models like GPT-4o and Gemini-2.5 Pro had hallucination rates of 6%, while newer reasoning-optimized models didn't consistently improve factual accuracy.

AINeutralarXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

Benchmarking Overton Pluralism in LLMs

Researchers introduced OVERTONBENCH, a framework for measuring viewpoint diversity in large language models through the OVERTONSCORE metric. In a study of 8 LLMs with 1,208 participants, models scored 0.35-0.41 out of 1.0, with DeepSeek V3 performing best, showing significant room for improvement in pluralistic representation.

AINeutralarXiv โ€“ CS AI ยท Mar 27/1010
๐Ÿง 

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Researchers propose a dynamic agent-centric benchmarking system for evaluating large language models that replaces static datasets with autonomous agents that generate, validate, and solve problems iteratively. The protocol uses teacher, orchestrator, and student agents to create progressively challenging text anomaly detection tasks that expose reasoning errors missed by conventional benchmarks.

AIBearisharXiv โ€“ CS AI ยท Mar 27/1014
๐Ÿง 

ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

Researchers have developed ForesightSafety Bench, a comprehensive AI safety evaluation framework covering 94 risk dimensions across 7 fundamental safety pillars. The benchmark evaluation of over 20 advanced large language models revealed widespread safety vulnerabilities, particularly in autonomous AI agents, AI4Science, and catastrophic risk scenarios.

AIBearisharXiv โ€“ CS AI ยท Mar 27/1019
๐Ÿง 

Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

Researchers propose a new risk-sensitive framework for evaluating AI hallucinations in medical advice that considers potential harm rather than just factual accuracy. The study reveals that AI models with similar performance show vastly different risk profiles when generating medical recommendations, highlighting critical safety gaps in current evaluation methods.

AINeutralarXiv โ€“ CS AI ยท Feb 275/106
๐Ÿง 

FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation

Researchers introduce FIRE, a comprehensive benchmark for evaluating Large Language Models' financial intelligence and reasoning capabilities. The benchmark includes theoretical financial knowledge tests from qualification exams and 3,000 practical financial scenario questions covering complex business domains.

AINeutralarXiv โ€“ CS AI ยท Feb 276/107
๐Ÿง 

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Researchers introduce PoSh, a new evaluation metric for detailed image descriptions that uses scene graphs to guide LLMs-as-a-Judge, achieving better correlation with human judgments than existing methods. They also present DOCENT, a challenging benchmark dataset featuring artwork with expert-written descriptions to evaluate vision-language models' performance on complex image analysis.

AINeutralOpenAI News ยท Feb 186/106
๐Ÿง 

Introducing the SWE-Lancer benchmark

A new benchmark called SWE-Lancer has been introduced to evaluate whether frontier large language models can earn $1 million through real-world freelance software engineering work. This benchmark tests AI capabilities in practical, revenue-generating programming tasks rather than traditional academic assessments.