AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers present a unified evaluation framework for assessing LLM agentic capabilities, integrating 7 benchmarks across 24 domains with standardized testing methodology. The framework disentangles intrinsic model performance from implementation artifacts, revealing that scaffold choices and environmental volatility significantly impact benchmark results across 15 models tested.
🏢 Meta🏢 Hugging Face
AIBearisharXiv – CS AI · 5d ago7/10
🧠Researchers introduced LiveK12Bench, a dynamic benchmark for evaluating Large Multimodal Models on realistic high school examinations across multiple disciplines. The study reveals that advanced LMMs like GPT-4 experience significant performance degradation when subjected to exam-realistic constraints, dropping from 79 to 53 points when process rigor and efficiency are jointly evaluated, exposing critical gaps between theoretical capabilities and practical educational readiness.
🧠 GPT-5
AIBearisharXiv – CS AI · 5d ago7/10
🧠Researchers reveal that AI models can possess stable factual knowledge while failing dramatically at compositional reasoning—assembling facts into logical chains—a problem invisible to standard benchmark metrics. The study introduces a diagnostic protocol showing post-training improvements mask directional shifts in composition capability, with failures often rooted in generation-time constraints rather than fundamental model limitations.
AINeutralarXiv – CS AI · 5d ago7/10
🧠Researchers introduce BeQu, a new benchmark that evaluates LLM knowledge through open-ended prompts rather than predefined questions, addressing availability bias in existing benchmarks. The paradigm shift from narrow question-answering to characterizing naturally expressed knowledge provides deeper insights into parametric knowledge across 10,000 entities and multiple language models.
AINeutralarXiv – CS AI · May 117/10
🧠Researchers introduce a scenario-grounded benchmark for evaluating large language models in scientific discovery, revealing significant performance gaps compared to general science benchmarks. The framework tests LLMs across biology, chemistry, materials, and physics through project-level tasks involving hypothesis generation and experimental design, showing that current models remain distant from achieving general scientific superintelligence despite demonstrating promise in specific applications.
AINeutralarXiv – CS AI · Apr 157/10
🧠Researchers propose a cognitive diagnostic framework that evaluates large language models across fine-grained ability dimensions rather than aggregate scores, enabling targeted model improvement and task-specific selection. The approach uses multidimensional Item Response Theory to estimate abilities across 35 dimensions for mathematics and generalizes to physics, chemistry, and computer science with strong predictive accuracy.
AINeutralarXiv – CS AI · Apr 147/10
🧠Researchers introduce General365, a benchmark revealing that leading LLMs achieve only 62.8% accuracy on general reasoning tasks despite excelling in domain-specific domains. The findings highlight a critical gap: current AI models rely heavily on specialized knowledge rather than developing robust, transferable reasoning capabilities applicable to real-world scenarios.
AIBearisharXiv – CS AI · Apr 107/10
🧠Researchers reveal that Large Language Models exhibit self-preference bias when evaluating other LLMs, systematically favoring outputs from themselves or related models even when using objective rubric-based criteria. The bias can reach 50% on objective benchmarks and 10-point score differences on subjective medical benchmarks, potentially distorting model rankings and hindering AI development.
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers introduce AVA-Bench, a new benchmark that evaluates vision foundation models (VFMs) by testing 14 distinct atomic visual abilities like localization and depth estimation. This approach provides more precise assessment than traditional VQA benchmarks and reveals that smaller 0.5B language models can evaluate VFMs as effectively as 7B models while using 8x fewer GPU resources.
AINeutralarXiv – CS AI · 3d ago6/10
🧠Researchers introduce JMed48k, a comprehensive Japanese medical licensing benchmark containing 48,862 exam questions and 20,142 images to evaluate vision-language models across 11 healthcare professions. Testing 21 models reveals significant disparities in how effectively different AI systems leverage visual information, with proprietary models gaining substantially from images while medical-specific systems show limited visual utilization.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce ClinPivot, a benchmark testing whether clinical AI models adjust treatment decisions when patient contexts change. The study reveals that strong medical QA performance does not correlate with sound clinical decision-making, with leading models often failing to modify treatment choices appropriately when clinical constraints shift.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers propose a replication-first paradigm for evaluating subjective LLM behaviors like empathy and restraint, using four orthogonal validation properties instead of single human-rater consensus. Testing across 49 models reveals that aggregate performance scores mask significant regressions in specific behavioral dimensions, such as gpt-5's 1.87-point decline in advice-restraint compared to gpt-4.1.
🧠 GPT-4🧠 GPT-5
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduced OCR-Reasoning, a new benchmark with 1,069 annotated examples to evaluate how well multimodal AI models handle text-rich image reasoning tasks. The evaluation revealed that even the most advanced models fail to exceed 50% accuracy, indicating significant gaps in this critical capability area.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers introduce GICDM, an improved method for evaluating generative models that corrects the hubness phenomenon—a distortion in high-dimensional spaces that skews distance-based metrics and nearest-neighbor relationships. The technique builds on classical ICDM and includes multi-scale extensions, demonstrating improved alignment with human assessment across synthetic and real benchmarks.
AIBullisharXiv – CS AI · May 126/10
🧠A new study challenges whether standard LLM benchmarks accurately measure hallucination detection performance. By having human adjudicators re-evaluate conflicting cases between original annotations and model predictions, researchers found that LLMs frequently made correct judgments that human annotators initially missed, suggesting single-pass human annotation may be insufficient for complex, ambiguous tasks.
🧠 GPT-5🧠 Gemini
AINeutralarXiv – CS AI · May 16/10
🧠Researchers propose a novel rule-generation approach to evaluate compositionality in large language models, addressing critical limitations in existing assessment methods that lack explainability and suffer from dataset partition leakage. This new framework requires LLMs to generate executable programs as rules for data mapping, providing more robust insights into how well these models generalize compositional concepts.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce MEDS (Math Education Digital Shadows), a dataset of 28,000 personas from 14 LLMs designed to evaluate how language models reason about mathematics and report their confidence levels. The dataset integrates math proficiency with psychological measures like anxiety and self-efficacy, revealing that LLMs exhibit human-like biases including negative attitudes and overconfidence in mathematical reasoning.
🧠 Grok
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce RPC-Bench, a large-scale benchmark containing 15,000 human-verified question-answer pairs designed to evaluate how well AI models understand research papers. Testing reveals that even the strongest models like GPT-5 achieve only 68.2% accuracy on comprehension tasks, dropping significantly when conciseness is factored in, exposing critical gaps in academic document understanding.
🧠 GPT-5
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers introduce MTR-DuplexBench, a new evaluation framework for Full-Duplex Speech Language Models that enables real-time overlapping conversations. The benchmark addresses critical gaps by assessing multi-round interactions across conversational quality, instruction-following, and safety dimensions, revealing that current FD-SLMs struggle with consistency across multiple communication rounds.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers propose Filtered Reasoning Score (FRS), a new evaluation metric that assesses the quality of reasoning in large language models beyond simple accuracy metrics. FRS focuses on the model's most confident reasoning traces, evaluating dimensions like faithfulness and coherence, revealing significant performance differences between models that appear identical under traditional accuracy benchmarks.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers introduce BERT-as-a-Judge, a lightweight alternative to LLM-based evaluation methods that assesses generative model outputs with greater accuracy than lexical approaches while requiring significantly less computational overhead. The method demonstrates that existing lexical evaluation techniques poorly correlate with human judgment across 36 models and 15 tasks, establishing a practical middle ground between rigid rule-based and expensive LLM-judge evaluation paradigms.
AIBullisharXiv – CS AI · Apr 76/10
🧠Researchers developed DualJudge, a new framework for evaluating large language models that combines structured Fuzzy Analytic Hierarchy Process (FAHP) with traditional direct scoring methods. The approach addresses inconsistent LLM evaluation by incorporating uncertainty-aware reasoning and achieved state-of-the-art performance on JudgeBench testing.
AINeutralarXiv – CS AI · Mar 276/10
🧠Researchers introduce RubricEval, the first rubric-level meta-evaluation benchmark for assessing how well AI judges evaluate instruction-following in large language models. Even advanced models like GPT-4o achieve only 55.97% accuracy on the challenging subset, highlighting significant gaps in AI evaluation reliability.
🧠 GPT-4
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers have developed the System Hallucination Scale (SHS), a human-centered tool for evaluating hallucination behavior in large language models. The instrument showed strong statistical validity in testing with 210 participants and provides a practical method for assessing AI model reliability from a user perspective.
AINeutralarXiv – CS AI · Mar 36/109
🧠Researchers propose a tensor factorization method that combines cheap automated evaluation data with limited human labels to enable fine-grained evaluation of AI generative models. The approach addresses the data bottleneck in model evaluation by using autorater scores to pretrain representations that are then aligned to human preferences with minimal calibration data.