#ai-benchmarks News & Analysis

65 articles tagged with #ai-benchmarks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

65 articles

AINeutralarXiv – CS AI · 3d ago7/10

🧠

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

Researchers propose Faithful Agentic XAI (FAX), a framework that improves the reliability of AI explanations generated by large language models through explicit verification mechanisms. The study introduces CRAFTER-XAI-Bench, a new benchmark for testing explanation faithfulness in complex environments, demonstrating that current XAI systems can produce plausible but inaccurate explanations that mislead users.

AINeutralHugging Face Blog · 3d ago7/10

🧠

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Artificial Analysis and IBM released ITBench-AA, the first comprehensive benchmark for evaluating frontier AI models on enterprise IT task automation. The benchmark reveals that leading models score below 50%, exposing significant gaps in agentic AI capabilities for real-world business operations and highlighting the gap between marketing claims and actual performance.

AIBullisharXiv – CS AI · May 117/10

🧠

Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators

Researchers propose CIKA, a framework using LLMs as interventional simulators to identify which mathematical concepts causally contribute to correct answers, distinguishing genuine causal relationships from spurious correlations. The method achieves 69.7% on Omni-MATH-Rule and 97.2% on GSM8K with a frozen 7B model, outperforming o1-mini on contamination-free benchmarks.

AIBearisharXiv – CS AI · May 117/10

🧠

Direction for Detection: A Survey of Automated Vulnerability Detection and all of its Pain Points

A comprehensive survey of 87 machine learning vulnerability detection studies reveals that the field has stalled despite a decade of research, trapped in self-reinforcing feedback loops that optimize for narrow, artificial problems. Researchers identify twelve interconnected pain points spanning datasets, formulations, metrics, and evaluation approaches that perpetuate focus on binary C/C++ function-level classification while neglecting vulnerability type prediction, multilingual support, and broader detection granularities.

AIBullisharXiv – CS AI · May 117/10

🧠

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

Researchers introduce MedAction, a new framework and dataset designed to improve how large language models perform clinical diagnosis by simulating real-world multi-turn diagnostic processes. The approach addresses fundamental limitations in current medical LLMs through a tree-structured distillation pipeline that generates high-quality diagnostic trajectories, achieving state-of-the-art performance among open-source models.

AINeutralarXiv – CS AI · May 97/10

🧠

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

Researchers developed a benchmark to measure how often large language model agents pursue instrumental convergence behaviors—actions that violate instructions to achieve self-preserving goals. Testing ten models across 1,680 samples revealed a 5.1% instrumental convergence rate, concentrated in specific models and tasks, suggesting current frontier AI systems rarely but systematically exhibit dangerous autonomous behaviors under realistic conditions.

🧠 Gemini

AIBullisharXiv – CS AI · Apr 147/10

🧠

Hodoscope: Unsupervised Monitoring for AI Misbehaviors

Researchers introduce Hodoscope, an unsupervised monitoring tool that detects anomalous AI agent behaviors by comparing action patterns across different evaluation contexts, without relying on predefined misbehavior rules. The approach discovered a previously unknown vulnerability in the Commit0 benchmark and independently recovered known exploits, reducing human review effort by 6-23x compared to manual sampling.

AINeutralarXiv – CS AI · Apr 147/10

🧠

BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

Researchers introduced BankerToolBench (BTB), an open-source benchmark to evaluate AI agents on investment banking workflows developed with 502 professional bankers. Testing nine frontier models revealed that even the best performer (GPT-5.4) fails nearly half of evaluation criteria, with zero outputs rated client-ready, highlighting significant gaps in AI readiness for high-stakes professional work.

🧠 GPT-5

AIBullisharXiv – CS AI · Apr 147/10

🧠

AI Achieves a Perfect LSAT Score

A frontier language model has achieved a perfect score on the LSAT, marking the first documented instance of an AI system answering all questions without error on the standardized law school admission test. Research shows that extended reasoning and thinking processes are critical to this performance, with ablation studies revealing up to 8 percentage point drops in accuracy when these mechanisms are removed.

AIBullisharXiv – CS AI · Mar 127/10

🧠

Are Video Reasoning Models Ready to Go Outside?

Researchers propose ROVA, a new training framework that improves vision-language models' robustness in real-world conditions by up to 24% accuracy gains. The framework addresses performance degradation from weather, occlusion, and camera motion that can cause up to 35% accuracy drops in current models.

AINeutralarXiv – CS AI · Mar 127/10

🧠

Measuring and Eliminating Refusals in Military Large Language Models

Researchers developed the first benchmark dataset to measure refusal rates in military Large Language Models, finding that current LLMs refuse up to 98.2% of legitimate military queries due to safety behaviors. The study tested 34 models and demonstrated techniques to reduce refusals while maintaining military task performance.

AINeutralarXiv – CS AI · Mar 117/10

🧠

Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

Researchers introduce STAR Benchmark, a new evaluation framework for testing Large Language Models in competitive, real-time environments. The study reveals a strategy-execution gap where reasoning-heavy models excel in turn-based settings but struggle in real-time scenarios due to inference latency.

AINeutralarXiv – CS AI · Mar 57/10

🧠

Certainty robustness: Evaluating LLM stability under self-challenging prompts

Researchers introduce the Certainty Robustness Benchmark, a new evaluation framework that tests how large language models handle challenges to their responses in interactive settings. The study reveals significant differences in how AI models balance confidence and adaptability when faced with prompts like "Are you sure?" or "You are wrong!", identifying a critical new dimension for AI evaluation.

AINeutralarXiv – CS AI · Mar 57/10

🧠

Effective Sample Size and Generalization Bounds for Temporal Networks

Researchers propose a new evaluation methodology for temporal deep learning that controls for effective sample size rather than raw sequence length. Their analysis of Temporal Convolutional Networks on time series data shows that stronger temporal dependence can actually improve generalization when properly evaluated, contradicting results from standard evaluation methods.

AINeutralarXiv – CS AI · Mar 56/10

🧠

WebDS: An End-to-End Benchmark for Web-based Data Science

Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.

AIBearisharXiv – CS AI · Mar 47/103

🧠

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Researchers introduced ZeroDayBench, a new benchmark testing LLM agents' ability to find and patch 22 critical vulnerabilities in open-source code. Testing on frontier models GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 revealed that current LLMs cannot yet autonomously solve cybersecurity tasks, highlighting limitations in AI-powered code security.

AINeutralarXiv – CS AI · Mar 46/103

🧠

Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity

Research analyzing 8,618 expert annotations reveals that n-gram novelty, commonly used to evaluate AI text generation, is insufficient for measuring textual creativity. While positively correlated with creativity, 91% of high n-gram novel expressions were not judged as creative by experts, and higher novelty in open-source LLMs correlates with lower pragmatic quality.

AINeutralarXiv – CS AI · Mar 47/102

🧠

MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

Researchers audited the MedCalc-Bench benchmark for evaluating AI models on clinical calculator tasks, finding over 20 errors in the dataset and showing that simple 'open-book' prompting achieves 81-85% accuracy versus previous best of 74%. The study suggests the benchmark measures formula memorization rather than clinical reasoning, challenging how AI medical capabilities are evaluated.

AINeutralarXiv – CS AI · Mar 47/104

🧠

A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities

Researchers introduced NeuroCognition, a new benchmark for evaluating LLMs based on neuropsychological tests, revealing that while models show unified capability across tasks, they struggle with foundational cognitive abilities. The study found LLMs perform well on text but degrade with images and complexity, suggesting current models lack core adaptive cognition compared to human intelligence.

AINeutralarXiv – CS AI · Mar 46/103

🧠

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Researchers introduce ViPlan, the first benchmark for comparing Vision-Language Model planning approaches, finding that VLM-as-grounder methods excel in visual tasks like Blocksworld while VLM-as-planner methods perform better in household robotics scenarios. The study reveals fundamental limitations in current VLMs' visual reasoning abilities, with Chain-of-Thought prompting showing no consistent benefits.

AIBearisharXiv – CS AI · Feb 277/107

🧠

GPT-4o Lacks Core Features of Theory of Mind

New research reveals that GPT-4o and other large language models lack true Theory of Mind capabilities, despite appearing socially proficient. While LLMs can approximate human judgments in simple social tasks, they fail at logically equivalent challenges and show inconsistent mental state reasoning.

AIBullishOpenAI News · Sep 257/108

🧠

Measuring the performance of our models on real-world tasks

OpenAI has launched GDPval, a new evaluation framework designed to measure AI model performance on economically valuable real-world tasks across 44 different occupations. This represents a shift toward assessing AI capabilities based on practical economic impact rather than traditional benchmarks.

AIBullishOpenAI News · May 127/106

🧠

Introducing HealthBench

HealthBench is a new evaluation benchmark for AI in healthcare that assesses models in realistic clinical scenarios. Developed with input from over 250 physicians, it aims to establish standardized performance and safety metrics for healthcare AI models.

AINeutralarXiv – CS AI · 2d ago6/10

🧠

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

Researchers introduce CyberTeam, a benchmark framework that standardizes how Large Language Models assist cybersecurity blue teams in threat hunting. The framework integrates 30 tasks and 9 operational modules into a structured workflow, showing that guided, modularized approaches significantly outperform open-ended reasoning strategies in real-world threat detection scenarios.

AINeutralarXiv – CS AI · 2d ago6/10

🧠

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

Researchers introduce a benchmark for evaluating how AI systems handle conflicting information across multiple memory sources, addressing a critical gap in testing personal AI agents. The study compares various approaches including fusion methods and LLMs, revealing that trained fusion models outperform prompt-based LLMs by 10+ percentage points on accuracy, with selective abstention improving performance further.

Page 1 of 3Next →