AIBullisharXiv – CS AI · Jun 57/10
🧠Researchers introduce Benchmark Agent, an autonomous AI system that automates the creation of machine learning benchmarks to address labor-intensive construction and performance saturation issues. The framework successfully generated 15 diverse benchmarks across text and multimodal understanding tasks, demonstrating that continually evolving benchmarks can accelerate LLM and MLLM development with minimal human oversight.
AIBullisharXiv – CS AI · Jun 57/10
🧠Researchers at the Max Planck Institute compiled 100 research-level mathematics questions to benchmark large language models' reasoning capabilities. Through three evaluation stages, only 2 questions remained unsolved by advanced LLMs, indicating significant progress in AI mathematical reasoning.
AINeutralarXiv – CS AI · Jun 27/10
🧠Researchers introduce ReasonBENCH, a comprehensive benchmark revealing that LLM reasoning systems exhibit significant performance variance across repeated executions, with the best-performing strategy winning only 77% of head-to-head comparisons. The study demonstrates that this instability is structured rather than random, challenging the validity of single-run benchmark scores as reliable indicators of model quality.
AIBullisharXiv – CS AI · Jun 17/10
🧠Researchers demonstrate that efficient LLM benchmarking can be substantially improved by treating it as a multiple regression problem with kernel ridge regression and applying minimum redundancy maximum relevance (mRMR) feature selection. The approach achieves lower prediction errors and faster computation than existing methods while maintaining consistency across different data splits.
AINeutralarXiv – CS AI · May 297/10
🧠Researchers propose Critique-Resilient Benchmarking, a new framework for evaluating large language models when human comprehension of tasks becomes infeasible. The method uses adversarial evaluation where answers are deemed correct if no convincing counterargument exists, allowing meaningful comparison of frontier LLMs even as they saturate traditional benchmarks.
AIBearisharXiv – CS AI · May 287/10
🧠Researchers introduce PortBench, a comprehensive benchmark for evaluating large language models in portfolio management tasks. The study reveals that 90% of tested LLMs fail to outperform basic equal-weight allocation strategies, highlighting significant gaps between LLM performance on financial QA tasks and real-world portfolio decision-making.
AIBullisharXiv – CS AI · May 287/10
🧠Researchers introduce RAGe, a benchmarking framework designed to optimize Retrieval-Augmented Generation (RAG) applications by evaluating trade-offs between accuracy, efficiency, and scalability. The framework enables developers to identify optimal pipeline components for domain-specific datasets while accounting for hardware constraints, making RAG development more accessible on consumer-grade hardware.
AINeutralarXiv – CS AI · May 277/10
🧠Researchers have identified significant measurement bias in production LLM benchmarking tools, where single-process architectures and Python's Global Interpreter Lock artificially inflate latency metrics at scale. The study proposes a multi-process evaluation framework and a new normalized metric (NTPOT) to accurately measure LLM serving performance under production-level concurrency.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers introduce a framework for evaluating whether AI creative systems cause population-level diversity collapse, where individual output quality improves while collective idea similarity increases. Testing three frontier LLMs across creative tasks, the study finds they fall below diversity parity with humans and proposes design interventions to mitigate crowding effects at development time.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers introduce General365, a benchmark revealing that leading LLMs achieve only 62.8% accuracy on general reasoning tasks despite excelling in domain-specific domains. The findings highlight a critical gap: current AI models rely heavily on specialized knowledge rather than developing robust, transferable reasoning capabilities applicable to real-world scenarios.
AINeutralarXiv – CS AI · Jun 106/10
🧠Researchers introduce T1-Bench, a comprehensive benchmark for evaluating large language model-based agents across 25 domains with multi-step, multi-domain tasks that better reflect real-world complexity than existing benchmarks. The framework tests 12 models on structured reasoning, tool utilization, and conversational quality, with both automated and human evaluation methods.
AINeutralarXiv – CS AI · Jun 106/10
🧠Researchers introduce RankLLM, a novel evaluation framework that quantifies both question difficulty and model competency to create more nuanced LLM benchmarks. The system uses bidirectional score propagation between models and questions, achieving 90% agreement with human judgment while outperforming existing methods like Item Response Theory.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers introduce RTL-BenchLS, a large-scale benchmark containing over 10,000 formally verified Verilog designs for evaluating large language models on hardware design tasks. The benchmark addresses limitations of existing datasets through three novel self-supervised tasks beyond specification-to-RTL generation, with top models achieving only 12-28% accuracy, demonstrating substantial room for improvement in LLM-based hardware automation.
AINeutralarXiv – CS AI · Jun 86/10
🧠Researchers introduced AARRI-Bench, a new benchmark suite designed to evaluate frontier large language models and AI agents on their ability to conduct research with human-like professionalism and nuance. Testing showed that even top-performing systems like Claude Opus 4.7 with Mini-SWE-Agent achieved only 68.3% success rates, frequently missing subtle but critical details that human researchers would easily catch, highlighting the gap between autonomous research agents and truly capable human researchers.
🧠 Claude🧠 Opus
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers propose a Bayesian hierarchical model with embedding-space clustering to correct fundamental flaws in LLM benchmarking methodology. The approach addresses two critical issues—insufficient evaluation samples and non-independent test prompts—improving performance metric accuracy by 4-73% in mean absolute errors, particularly relevant for adversarial robustness evaluation.
AINeutralarXiv – CS AI · Jun 46/10
🧠Researchers introduce KINA, a new 899-item benchmark for evaluating large language models across 261 disciplines, addressing methodological issues in existing knowledge benchmarks. The study evaluates 42 models with formal guarantees on representativeness and ranking stability, revealing a tiered performance structure with Gemini-3.1-Pro-Preview leading at 53.17% accuracy.
🧠 GPT-5🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers propose a graph-based framework using Maximum Independent Set algorithms to efficiently benchmark large language models by selecting diverse, non-redundant prompt subsets. Testing across 66 LLMs and four major benchmarks demonstrates consistent rankings with 25-48% prompt reduction while maintaining reliability, offering significant computational savings for LLM evaluation.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce LogDx-CI, a benchmark comparing 11 log-reduction tools for debugging CI failures using LLMs, finding that hybrid grep+tail routers achieve the best cost-quality tradeoff while agent-loop systems can recover from weak contexts through iterative tool calls, though at higher computational cost.
🏢 OpenAI🧠 GPT-5🧠 Claude
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce Code-QA-Bench, an automated framework that generates repository-level code understanding benchmarks while distinguishing genuine code comprehension from documentation recall. Testing four frontier AI models reveals that code access is the primary driver of performance, while documentation provides marginal benefits, suggesting current models excel at code reasoning when source material is available.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers introduce BenGER, a comprehensive benchmark dataset for evaluating large language models on German legal reasoning tasks, comprising 596 exam-style cases and 531 doctrinal reasoning problems. The study demonstrates that LLM-as-a-Judge frameworks can achieve near-human consistency in legal assessment, with human-AI collaboration substantially outperforming unaided human performance.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce MeDial-Speech, a new 111+ hour speech dataset for training medical AI systems to conduct patient consultations across four health conditions. The study benchmarks state-of-the-art LLMs including Claude Sonnet 4, GPT-5 mini, and DeepSeek-V3, revealing that while Claude Sonnet 4 achieves 71-75% accuracy in medical dialogue tasks, all models exhibit significant overconfidence in their probabilistic predictions.
🏢 Hugging Face🧠 GPT-5🧠 Claude
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduced FrontierOR, a benchmark that tests whether leading LLMs can design efficient optimization algorithms for real-world large-scale problems. The evaluation of seven models reveals significant limitations: even frontier models outperform Gurobi (a standard solver) in only 31% of cases, highlighting a substantial gap between LLM capabilities in formulation and practical algorithmic optimization.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce VeriContest, a benchmark of 946 competitive-programming problems designed to evaluate AI models' ability to generate not just functional code but also formal specifications and machine-checkable proofs. Testing ten state-of-the-art models reveals a dramatic capability gap: while the strongest model achieves 92% accuracy on code generation alone, performance plummets to 48% on specifications, 14% on proofs, and just 5% end-to-end, identifying proof generation as the critical bottleneck for verifiable code generation systems.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce PostEDA-Bench, a hierarchical benchmark for evaluating LLM-based agents in Electronic Design Automation tasks, specifically targeting Design Rule Check (DRC) fixing and Power-Performance-Area (PPA) optimization. Testing eight LLMs across 145 tasks reveals significant performance gaps, with best success rates of 36.66% for complex DRC reasoning and only 20% for multi-objective PPA optimization, indicating substantial room for improvement in AI-assisted chip design automation.
AINeutralarXiv – CS AI · May 115/10
🧠ENGINEERING Ingegneria Informatica has released EngGPT2MoE-16B-A3B, a 16-billion parameter Mixture of Experts language model that demonstrates competitive or superior performance compared to Italian and international open-source LLMs across multiple benchmarks. The model represents a notable advancement for Italian-language AI capabilities while positioning itself competitively within the global open-source LLM landscape.
🧠 GPT-5🧠 Llama