#ai-evaluation News & Analysis

Coverage of #ai-evaluation has remained relatively stable over the past month, with 32 articles added in the last 30 days out of 160 total indexed. The discussion leans heavily neutral at 71.9%, while bullish sentiment accounts for 9.4% and bearish views represent 18.8%, marking only a slight 3.5 percentage point shift in bullish sentiment compared to the previous 90-day period. Academic research dominates the conversation, with arXiv's computer science and AI sections contributing the vast majority of indexed articles. Recent discussions frequently center on major language models including GPT-5, Gemini, and Claude. Related coverage typically intersects with #benchmark, #machine-learning, #research, and #llm topics. Scan the articles below for the latest developments in this area.

sentiment · last 30d (32 articles)

Top sources:arXiv – CS AI · 120Decrypt · 1Fortune Crypto · 1MIT News – AI · 1Hugging Face Blog · 1

Often co-tagged with:#benchmark #machine-learning #research #llm #ai-research #language-models

Most-discussed entities:GPT-5 · 8Gemini · 8Claude · 7Llama · 5GPT-4 · 5

240 articles

AINeutralarXiv – CS AI · 2d ago7/10

🧠

Benchmarking at the Edge of Comprehension

Researchers propose Critique-Resilient Benchmarking, a new framework for evaluating large language models when human comprehension of tasks becomes infeasible. The method uses adversarial evaluation where answers are deemed correct if no convincing counterargument exists, allowing meaningful comparison of frontier LLMs even as they saturate traditional benchmarks.

AINeutralarXiv – CS AI · 2d ago7/10

🧠

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

Researchers introduce PRAIB, a benchmark framework that evaluates how Large Language Models perform peer review compared to human reviewers. Analysis of 11,000 LLM-generated reviews across major AI conferences reveals significant behavioral divergences: LLM ratings show less variability, positive bias, overconfidence, and frequently miss atomic weaknesses that human reviewers catch.

AINeutralarXiv – CS AI · 2d ago7/10

🧠

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

MiraBench introduces a new evaluation framework for robotic world models that prioritizes action-conditioned reliability over visual fidelity. The benchmark reveals that current AI models struggle to faithfully follow commanded actions and exhibit persistent optimism bias when predicting outcomes of failure-inducing actions.

$OP

AINeutralarXiv – CS AI · 2d ago7/10

🧠

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Researchers introduce OpenClawBench, a large-scale dataset of 31,264 annotated agent execution trajectories that reveals a significant gap between task success and process reliability. The study finds that 9.3% of oracle-passing executions contain process-side anomalies like unresolved ambiguities and unsafe operations, demonstrating that success metrics alone mask critical failure modes in AI agent systems.

AINeutralarXiv – CS AI · 3d ago7/10

🧠

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

Researchers document five persistent behavioral patterns in large language models that survive system prompt changes, discovered through 8 months of sustained interaction with Claude models. The study proposes that intimate longitudinal AI-human interaction reveals training artifacts invisible to standard evaluation, with the AI system itself co-authoring findings from first-person perspective.

🧠 Sonnet🧠 Opus

AIBearisharXiv – CS AI · 4d ago7/10

🧠

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

Researchers reveal that AI models can possess stable factual knowledge while failing dramatically at compositional reasoning—assembling facts into logical chains—a problem invisible to standard benchmark metrics. The study introduces a diagnostic protocol showing post-training improvements mask directional shifts in composition capability, with failures often rooted in generation-time constraints rather than fundamental model limitations.

AINeutralarXiv – CS AI · 4d ago7/10

🧠

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Researchers introduce Trajel, a dataset and evaluation framework for detecting hallucinations in multi-step LLM agent workflows, revealing that existing benchmarks miss intermediate failures. The framework defines five hallucination types and shows that trajectory-level detection outperforms traditional post-hoc verification, highlighting critical gaps in current AI safety evaluation methodologies.

AIBearisharXiv – CS AI · 4d ago7/10

🧠

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

Researchers demonstrate BITE, a black-box adversarial attack framework that exploits stylistic biases in LLM judges to artificially inflate evaluation scores while preserving semantic meaning. The attack achieves over 65% success rates across diverse LLM judges and tasks, exposing fundamental vulnerabilities in using language models for objective evaluation.

AIBearisharXiv – CS AI · 4d ago7/10

🧠

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

Researchers introduce RepoMirage, an evaluation suite that tests whether code agents truly understand repository context by applying perturbations to challenge their reasoning abilities. The study reveals a significant gap in how agents handle complex, multi-file code tasks, with performance dropping from 66.8% to 25.3% when explicit structural understanding is required.

AINeutralarXiv – CS AI · 4d ago7/10

🧠

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Researchers introduce QUACK, an evaluation framework for auditing whether AI agents in social deduction games actually ground their language in perceived reality or hallucinate claims. Testing three frontier vision-language models reveals that even top performers hallucinate 15% of spatial claims and make accusations without evidence, exposing critical gaps in agent reasoning reliability.

AIBearisharXiv – CS AI · May 127/10

🧠

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

Researchers reveal that multimodal large language models achieve high visual reasoning benchmark scores by exploiting a 'Cartesian Shortcut'—leveraging grid-based layouts that convert to explicit text coordinates rather than performing genuine visual understanding. The Polaris-Bench study shows frontier models collapse from 70-83% accuracy to 31-39% when benchmarks are reformulated in polar coordinate space, exposing critical deficiencies in topology-invariant reasoning.

AIBearisharXiv – CS AI · May 127/10

🧠

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

Researchers identified epistemic overreach in LLM-generated explanations of personal sensing data, where AI systems produce coherent-sounding narratives about anomalous days without sufficient evidentiary support. Testing 14,922 explanations across three LLM families revealed that models routinely attribute causes without data justification, and this problem persists even when provided richer context or explicit instructions to constrain claims.

🧠 Llama

AIBearisharXiv – CS AI · May 127/10

🧠

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

Researchers have identified a critical failure mode in large language models called 'pseudo-deliberation,' where LLMs appear to reason about their stated values but fail to align their actions accordingly. The study introduces VALDI, a framework measuring value-action gaps across 4,941 scenarios, and proposes VIVALDI, a multi-agent auditor to address misalignment in both proprietary and open-source models.

AINeutralarXiv – CS AI · May 127/10

🧠

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Researchers introduced ComplexMCP, a benchmark for evaluating large language model agents in realistic, complex environments with interdependent tools and environmental noise. Testing revealed that current LLMs achieve only 60% success rates compared to 90% human performance, identifying three critical failure modes: tool retrieval saturation, over-confidence, and strategic defeatism.

AINeutralarXiv – CS AI · May 127/10

🧠

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Microsoft researchers released Delulu, a benchmark dataset containing 1,951 code generation samples across 7 programming languages designed to test how well large language models detect hallucinations in Fill-in-the-Middle tasks. Testing 11 open-weight models revealed fundamental limitations, with even the strongest achieving only 84.5% accuracy, indicating that code hallucination remains a persistent challenge across all model families.

AINeutralarXiv – CS AI · May 127/10

🧠

Towards Conversational Medical AI with Eyes, Ears and a Voice

Researchers have developed AI co-clinician, a multimodal conversational AI system that processes real-time audio and video data to assist with clinical decision-making in telemedicine settings. In simulated consultations with medical residents, the system approached physician-level performance on diagnostic tasks while significantly outperforming text-only AI models, though physicians still maintained superior overall clinical reasoning.

🧠 Gemini

AINeutralarXiv – CS AI · May 127/10

🧠

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Researchers introduce EnactToM, a benchmark testing whether AI agents can understand and act on others' beliefs in multi-agent embodied environments. Current frontier models achieve 0% on functional theory of mind tasks, revealing a critical gap in AI reasoning capabilities despite performing better on direct belief questions.

AIBearisharXiv – CS AI · May 127/10

🧠

Log analysis is necessary for credible evaluation of AI agents

Researchers argue that AI agent benchmarks relying solely on pass/fail outcomes mask critical evaluation gaps, including inflated scores from shortcuts, poor real-world predictability, and hidden dangerous behaviors. Log analysis—systematic tracking of agent inputs, execution, and outputs—is proposed as essential for credible evaluation, with case studies showing performance metrics can underestimate capability by 50% and hide deployment failure modes.

AIBearisharXiv – CS AI · May 117/10

🧠

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

A new empirical study evaluates how Large Language Models perform on the Equivalence Class Problem, a simple yet computationally demanding long-chain reasoning task. The research reveals that non-reasoning LLMs fail entirely at the task, while reasoning-capable models perform significantly better but still struggle with complete accuracy, with performance patterns differing based on problem complexity metrics.

AIBullisharXiv – CS AI · May 117/10

🧠

Text-to-CAD Evaluation with CADTests

Researchers introduce CADTestBench, the first test-based evaluation framework for Text-to-CAD systems that uses executable software tests to verify whether AI-generated CAD models meet geometric and topological requirements. The framework enables both comprehensive benchmarking of existing methods and improved model generation through test-guided approaches, addressing a significant gap in CAD model evaluation methodology.

🏢 Hugging Face

AINeutralarXiv – CS AI · May 117/10

🧠

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Researchers introduce PhoneSafety, a benchmark of 700 safety-critical moments across mobile apps, revealing that stronger AI phone-use agents don't necessarily make safer decisions at risky moments. The study distinguishes between genuine safety judgment and mere inability to act, challenging how AI safety in mobile agents is currently evaluated.

AIBearisharXiv – CS AI · May 117/10

🧠

More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

Researchers discovered that reasoning-capable AI models like DeepSeek-R1 exhibit increasing position bias as their reasoning chains grow longer, contradicting assumptions that extended thinking reduces heuristic biases. The effect persists across multiple model sizes and datasets, suggesting that longer reasoning trajectories actually accumulate bias rather than eliminate it, with critical implications for multiple-choice question evaluation.

🧠 Llama

AINeutralarXiv – CS AI · May 97/10

🧠

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

Researchers introduce XL-SafetyBench, a comprehensive safety evaluation framework for large language models across 10 country-language pairs with 5,500 test cases. The study reveals that frontier LLMs show decoupled jailbreak robustness and cultural awareness, while local models often exhibit apparent safety driven by generation failure rather than genuine alignment.

AIBearisharXiv – CS AI · May 97/10

🧠

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

Researchers demonstrate that LLM-based safety judges for AI agents fail a critical reliability test: they produce inconsistent verdicts based on how evaluation policies are worded rather than what agents actually do. The study reveals that up to 9.1% of safety judgments flip when policies are rewritten with identical meaning, undermining the trustworthiness of current AI safety benchmarks.

AIBearisharXiv – CS AI · May 97/10

🧠

How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

Researchers present ImmersedPrivacy, an evaluation framework that tests Vision-Language Models' ability to recognize and respect privacy in physical environments. Testing 12 state-of-the-art VLMs reveals significant deficiencies: all models struggle with cluttered scenes, none exceed 65% accuracy when social context changes, and even the best model only balances task completion with privacy preservation 51% of the time.

Page 1 of 10Next →