47 articles tagged with #ai-benchmarks. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv – CS AI · 2d ago7/10
🧠Researchers introduced BankerToolBench (BTB), an open-source benchmark to evaluate AI agents on investment banking workflows developed with 502 professional bankers. Testing nine frontier models revealed that even the best performer (GPT-5.4) fails nearly half of evaluation criteria, with zero outputs rated client-ready, highlighting significant gaps in AI readiness for high-stakes professional work.
🧠 GPT-5
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce Hodoscope, an unsupervised monitoring tool that detects anomalous AI agent behaviors by comparing action patterns across different evaluation contexts, without relying on predefined misbehavior rules. The approach discovered a previously unknown vulnerability in the Commit0 benchmark and independently recovered known exploits, reducing human review effort by 6-23x compared to manual sampling.
AIBullisharXiv – CS AI · 2d ago7/10
🧠A frontier language model has achieved a perfect score on the LSAT, marking the first documented instance of an AI system answering all questions without error on the standardized law school admission test. Research shows that extended reasoning and thinking processes are critical to this performance, with ablation studies revealing up to 8 percentage point drops in accuracy when these mechanisms are removed.
AINeutralarXiv – CS AI · Mar 127/10
🧠Researchers developed the first benchmark dataset to measure refusal rates in military Large Language Models, finding that current LLMs refuse up to 98.2% of legitimate military queries due to safety behaviors. The study tested 34 models and demonstrated techniques to reduce refusals while maintaining military task performance.
AIBullisharXiv – CS AI · Mar 127/10
🧠Researchers propose ROVA, a new training framework that improves vision-language models' robustness in real-world conditions by up to 24% accuracy gains. The framework addresses performance degradation from weather, occlusion, and camera motion that can cause up to 35% accuracy drops in current models.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers introduce STAR Benchmark, a new evaluation framework for testing Large Language Models in competitive, real-time environments. The study reveals a strategy-execution gap where reasoning-heavy models excel in turn-based settings but struggle in real-time scenarios due to inference latency.
AINeutralarXiv – CS AI · Mar 56/10
🧠Researchers introduce WebDS, a new benchmark for evaluating AI agents on real-world web-based data science tasks across 870 scenarios and 29 websites. Current state-of-the-art LLM agents achieve only 15% success rates compared to 90% human accuracy, revealing significant gaps in AI capabilities for complex data workflows.
AINeutralarXiv – CS AI · Mar 57/10
🧠Researchers introduce the Certainty Robustness Benchmark, a new evaluation framework that tests how large language models handle challenges to their responses in interactive settings. The study reveals significant differences in how AI models balance confidence and adaptability when faced with prompts like "Are you sure?" or "You are wrong!", identifying a critical new dimension for AI evaluation.
AINeutralarXiv – CS AI · Mar 57/10
🧠Researchers propose a new evaluation methodology for temporal deep learning that controls for effective sample size rather than raw sequence length. Their analysis of Temporal Convolutional Networks on time series data shows that stronger temporal dependence can actually improve generalization when properly evaluated, contradicting results from standard evaluation methods.
AIBearisharXiv – CS AI · Mar 47/103
🧠Researchers introduced ZeroDayBench, a new benchmark testing LLM agents' ability to find and patch 22 critical vulnerabilities in open-source code. Testing on frontier models GPT-5.2, Claude Sonnet 4.5, and Grok 4.1 revealed that current LLMs cannot yet autonomously solve cybersecurity tasks, highlighting limitations in AI-powered code security.
AINeutralarXiv – CS AI · Mar 47/104
🧠Researchers introduced NeuroCognition, a new benchmark for evaluating LLMs based on neuropsychological tests, revealing that while models show unified capability across tasks, they struggle with foundational cognitive abilities. The study found LLMs perform well on text but degrade with images and complexity, suggesting current models lack core adaptive cognition compared to human intelligence.
AINeutralarXiv – CS AI · Mar 47/102
🧠Researchers audited the MedCalc-Bench benchmark for evaluating AI models on clinical calculator tasks, finding over 20 errors in the dataset and showing that simple 'open-book' prompting achieves 81-85% accuracy versus previous best of 74%. The study suggests the benchmark measures formula memorization rather than clinical reasoning, challenging how AI medical capabilities are evaluated.
AINeutralarXiv – CS AI · Mar 46/103
🧠Researchers introduce ViPlan, the first benchmark for comparing Vision-Language Model planning approaches, finding that VLM-as-grounder methods excel in visual tasks like Blocksworld while VLM-as-planner methods perform better in household robotics scenarios. The study reveals fundamental limitations in current VLMs' visual reasoning abilities, with Chain-of-Thought prompting showing no consistent benefits.
AINeutralarXiv – CS AI · Mar 46/103
🧠Research analyzing 8,618 expert annotations reveals that n-gram novelty, commonly used to evaluate AI text generation, is insufficient for measuring textual creativity. While positively correlated with creativity, 91% of high n-gram novel expressions were not judged as creative by experts, and higher novelty in open-source LLMs correlates with lower pragmatic quality.
AIBearisharXiv – CS AI · Feb 277/107
🧠New research reveals that GPT-4o and other large language models lack true Theory of Mind capabilities, despite appearing socially proficient. While LLMs can approximate human judgments in simple social tasks, they fail at logically equivalent challenges and show inconsistent mental state reasoning.
AIBullishOpenAI News · Sep 257/108
🧠OpenAI has launched GDPval, a new evaluation framework designed to measure AI model performance on economically valuable real-world tasks across 44 different occupations. This represents a shift toward assessing AI capabilities based on practical economic impact rather than traditional benchmarks.
AIBullishOpenAI News · May 127/106
🧠HealthBench is a new evaluation benchmark for AI in healthcare that assesses models in realistic clinical scenarios. Developed with input from over 250 physicians, it aims to establish standardized performance and safety metrics for healthcare AI models.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce TimeSeriesExamAgent, a scalable framework for automatically generating time series reasoning benchmarks using LLM agents and templates. The study reveals that while large language models show promise in time series tasks, they significantly underperform in abstract reasoning and domain-specific applications across healthcare, finance, and weather domains.
AINeutralMIT Technology Review · 2d ago6/10
🧠Stanford University's 2026 AI Index report provides data-driven insights into the current state of artificial intelligence, offering a counterbalance to conflicting narratives about AI's impact on jobs, capabilities, and market dynamics. The annual report serves as a comprehensive assessment of AI development and adoption trends across the industry.
AIBearisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce OmniBehavior, a benchmark for evaluating large language models' ability to simulate real-world human behavior across complex, long-horizon scenarios. The study reveals that current LLMs struggle with authentic behavioral simulation and exhibit systematic biases toward homogenized, overly-positive personas rather than capturing individual differences and realistic long-tail behaviors.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce VisionFoundry, a synthetic data generation pipeline that uses LLMs and text-to-image models to create targeted training data for vision-language models. The approach addresses VLMs' weakness in visual perception tasks and demonstrates 7-10% improvements on benchmark tests without requiring human annotation or reference images.
AINeutralarXiv – CS AI · Apr 76/10
🧠Researchers have developed a new automated pipeline that generates challenging math problems by first identifying specific mathematical concepts where LLMs struggle, then creating targeted problems to test these weaknesses. The method successfully reduced a leading LLM's accuracy from 77% to 45%, demonstrating its effectiveness at creating more rigorous benchmarks.
🧠 Llama
AINeutralarXiv – CS AI · Apr 76/10
🧠Researchers introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite for evaluating AI models on professional graphic design tasks including layout, typography, and animation. Testing reveals current AI models struggle with spatial reasoning, vector code generation, and typographic precision despite showing promise in high-level semantic understanding.
AINeutralarXiv – CS AI · Apr 66/10
🧠Researchers introduce XpertBench, a new benchmark for evaluating Large Language Models on expert-level professional tasks across domains like finance, healthcare, and legal services. Even top-performing LLMs achieve only ~66% success rates, revealing a significant 'expert-gap' in current AI systems' ability to handle complex professional work.
AINeutralarXiv – CS AI · Mar 276/10
🧠Researchers introduce a new nonparametric method called signed isotonic R² for efficiently detecting problematic items in AI benchmarks and assessments. The method outperforms traditional diagnostic techniques across major AI datasets including GSM8K and MMLU, offering a lightweight solution for improving evaluation quality.