AINeutralarXiv – CS AI · 3d ago7/10
🧠Researchers introduce AutoLab, a benchmark testing whether frontier AI models can solve complex, multi-step engineering tasks over extended time horizons. Testing 17 state-of-the-art models reveals that persistence and iterative refinement—not initial quality—predict success, with most models failing to sustain long-horizon optimization despite their capabilities.
AINeutralarXiv – CS AI · May 97/10
🧠Researchers propose Dynamic Boundary Evaluation (DBE), a new methodology for assessing large language models that adapts to each model's capability level rather than applying fixed benchmarks. The approach identifies performance boundaries where models achieve ~50% accuracy and calibrates them on a unified difficulty scale, addressing limitations in traditional evaluation that produce ceiling and floor effects masking true capability gaps.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers introduce AgencyBench, a comprehensive benchmark for evaluating autonomous AI agents across 32 real-world scenarios requiring up to 1 million tokens and 90 tool calls. The evaluation reveals closed-source models like Claude significantly outperform open-source alternatives (48.4% vs 32.1%), with notable performance variations based on execution frameworks and model optimization.
🧠 Claude
AINeutralarXiv – CS AI · Mar 267/10
🧠Researchers propose a new method called coupled autoregressive generation to evaluate large language models more efficiently by controlling for randomness in their responses. The study shows this approach can reduce evaluation samples by up to 75% while revealing that current model rankings may be confounded by inherent randomness in generation processes.
🧠 Llama
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers introduced WebCoderBench, the first comprehensive benchmark for evaluating web application generation by large language models, featuring 1,572 real-world user requirements and 24 evaluation metrics. The benchmark tests 12 representative LLMs and shows no single model dominates across all metrics, providing opportunities for targeted improvements.
AINeutralarXiv – CS AI · Mar 97/10
🧠Researchers introduce AdAEM, a new evaluation algorithm that automatically generates test questions to better assess value differences and biases across Large Language Models. Unlike static benchmarks, AdAEM adaptively creates controversial topics that reveal more distinguishable insights about LLMs' underlying values and cultural alignment.
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduced TensorBench, a 199-task benchmark for evaluating coding agents on a PyTorch-based tensor framework, addressing the trade-off between task difficulty and evaluation reliability in repository-level coding benchmarks. Testing seven frontier AI models revealed significant performance variation, with pass rates ranging from 64.8% to 22.1%, suggesting distinct strengths across different coding agent architectures.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce FALSIFYBENCH, an evaluation framework that tests whether large language models can perform inductive reasoning through hypothesis-driven discovery tasks. Testing 12 LLMs reveals that reasoning models outperform instruction-tuned models, with success primarily driven by the ability to actively falsify hypotheses rather than confirm them.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce KINA, a new 899-item benchmark for evaluating large language models across 261 disciplines, addressing methodological issues in existing knowledge benchmarks. The study evaluates 42 models with formal guarantees on representativeness and ranking stability, revealing a tiered performance structure with Gemini-3.1-Pro-Preview leading at 53.17% accuracy.
🧠 GPT-5🧠 Claude🧠 Gemini
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduced GTBench, a curriculum-based benchmark with 63 graph theory problems designed to evaluate LLMs as mathematical research assistants. Testing five frontier models revealed significant performance gaps, with GPT-5 substantially outperforming competitors on advanced proofs while all models struggled with graduate-level reasoning, raising concerns about AI reliability in technical education and research.
🧠 GPT-5🧠 Claude🧠 Sonnet
AINeutralarXiv – CS AI · 5d ago5/10
🧠A new study comparing machine learning approaches for churn prediction finds that traditional methods like Random Forests and XGBoost outperform advanced deep learning models in predictive accuracy, efficiency, and computational resource requirements. The research challenges the assumption that complex temporal models are always superior for time-series classification tasks in customer retention.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduce CodeGolf Bench, a new benchmark for evaluating Large Language Models' ability to generate concise code across 60 programming languages. The study reveals that reasoning-capable models significantly outperform standard LLMs, achieving 70.97% average percentile performance on code golf tasks, particularly excelling in languages with strict syntax requirements.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduce SURE, a unified experimentation framework that standardizes evaluation metrics and training pipelines for speech understanding models, addressing reproducibility challenges that have hindered fair comparison of speech foundation models and Speech LLMs across different deployment scenarios.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers conducted a comprehensive meta-study evaluating the robustness of multilingual text embedding models across 230+ languages using the MTEB benchmark platform. The analysis reveals that LLM-based models show task-specific strengths but few models consistently perform well across all tasks and evaluation methods, highlighting how benchmarking conclusions depend heavily on dataset composition and aggregation methodology choices.
AINeutralarXiv – CS AI · 6d ago6/10
🧠Researchers introduce FEM-Bench, a scientific reasoning benchmark designed to evaluate large language models' ability to generate correct finite element method (FEM) code for computational mechanics problems. Despite the simplicity of introductory-level tasks, current state-of-the-art LLMs show inconsistent performance, with Gemini 3 Pro completing 30/33 tasks at least once and GPT-5 achieving 73.8% success on unit test writing.
🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · May 116/10
🧠Researchers evaluated metacognitive monitoring across 33 frontier LLMs using 47,151 MMLU benchmark items, finding significant domain-level variation masked by aggregate performance scores. Applied/Professional knowledge domains showed consistently strong self-monitoring (AUROC .742), while Formal Reasoning and Natural Science proved most challenging, with implications for targeted model deployment.
🏢 OpenAI🏢 Anthropic🧠 Gemini
AINeutralarXiv – CS AI · May 116/10
🧠A comprehensive eight-week study evaluated 68 HTML generations from four major LLM families (GPT, Gemini, Grok, Claude) in standardized web generation tasks, finding Claude delivered the most consistent performance while questioning assumptions about reasoning time and social media predictability. The research reveals significant evaluation bias in LLM-as-judge systems and that code verbosity correlates more with model architecture than prompt specificity.
🧠 Claude🧠 Gemini🧠 Grok
AINeutralarXiv – CS AI · May 115/10
🧠Researchers compared ensemble machine learning techniques for predicting obesity risk, finding that ensemble stacking with a neural network meta-classifier outperformed hybrid voting methods, particularly on complex datasets. The study evaluated nine ML algorithms across 50 hyperparameter configurations, demonstrating that stacking achieves superior accuracy (up to 98.98%) for healthcare predictive modeling.
AIBearisharXiv – CS AI · May 16/10
🧠A comprehensive study comparing 12 large language models against 4 classical classifiers for automating evidence screening in software engineering systematic literature reviews reveals that LLMs exhibit significant performance variability and lack consistent superiority over traditional methods. The research emphasizes that abstract availability is critical for LLM performance, while title and keywords provide minimal additional value, suggesting LLM adoption should be driven by operational constraints rather than performance guarantees.
🏢 OpenAI🏢 Anthropic🧠 Gemini
AINeutralarXiv – CS AI · May 16/10
🧠Researchers adapted clinical psychology's Reliable Change Index to evaluate LLM performance across model versions, revealing that aggregate accuracy gains mask substantial item-level volatility. Testing Llama 3→3.1 and Qwen 2.5→3 showed bidirectional changes with large effect sizes, where improvements in low-accuracy domains offset deteriorations in high-accuracy ones, suggesting current evaluation methods underestimate model instability.
🧠 Llama
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers propose League of LLMs (LOL), a benchmark-free evaluation framework that uses mutual peer assessment among multiple LLMs to overcome data contamination and evaluation bias issues. Testing on eight mainstream models reveals 70.7% ranking consistency while uncovering model-specific behaviors like memorization patterns and family-based scoring bias in OpenAI models.
🏢 OpenAI
AINeutralarXiv – CS AI · Apr 146/10
🧠SRBench introduces a comprehensive evaluation framework for Sequential Recommendation models that combines Large Language Models with traditional neural network approaches. The benchmark addresses critical gaps in existing evaluation methodologies by incorporating fairness, stability, and efficiency metrics alongside accuracy, while establishing fair comparison mechanisms between LLM-based and neural network-based recommendation systems.
🏢 Meta
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers have developed a comprehensive evaluation framework for Large Language Models applied to outpatient referral systems in healthcare, revealing that LLMs offer limited advantages over simpler BERT-like models in static referral tasks but demonstrate potential in interactive dialogue scenarios. The study addresses the absence of standardized evaluation criteria for assessing LLM effectiveness in dynamic healthcare settings.
AINeutralarXiv – CS AI · Mar 166/10
🧠Researchers have launched LLM BiasScope, an open-source web application that enables real-time bias analysis and side-by-side comparison of outputs from major language models including Google Gemini, DeepSeek, and Meta Llama. The platform uses a two-stage bias detection pipeline and provides interactive visualizations to help researchers and practitioners evaluate bias patterns across different AI models.
🏢 Hugging Face🧠 Gemini🧠 Llama
AIBearishMIT News – AI · Feb 96/107
🧠A new study reveals that online platforms ranking large language models (LLMs) can produce unreliable results, with rankings significantly changing when just a small portion of crowdsourced data is removed. This highlights potential vulnerabilities in how AI model performance is evaluated and compared publicly.