y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#model-evaluation News & Analysis

67 articles tagged with #model-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

67 articles
AINeutralarXiv – CS AI · 6d ago6/10
🧠

DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

Researchers introduce DISSECT, a 12,000-question diagnostic benchmark that reveals a critical "perception-integration gap" in Vision-Language Models—where VLMs successfully extract visual information but fail to reason about it during downstream tasks. Testing 18 VLMs across Chemistry and Biology shows open-source models systematically struggle with integrating visual input into reasoning, while closed-source models demonstrate superior integration capabilities.

AIBullisharXiv – CS AI · Apr 76/10
🧠

VERT: Reliable LLM Judges for Radiology Report Evaluation

Researchers introduced VERT, a new LLM-based metric for evaluating radiology reports that shows up to 11.7% better correlation with radiologist judgments compared to existing methods. The study demonstrates that fine-tuned smaller models can achieve significant performance gains while reducing inference time by up to 37.2 times.

AINeutralarXiv – CS AI · Apr 76/10
🧠

Discovering Failure Modes in Vision-Language Models using RL

Researchers developed an AI framework using reinforcement learning to automatically discover failure modes in vision-language models without human intervention. The system trains a questioner agent that generates adaptive queries to expose weaknesses, successfully identifying 36 novel failure modes across various VLM combinations.

AIBearisharXiv – CS AI · Apr 66/10
🧠

DeltaLogic: Minimal Premise Edits Reveal Belief-Revision Failures in Logical Reasoning Models

Researchers introduce DeltaLogic, a new benchmark that tests AI models' ability to revise their logical conclusions when presented with minimal changes to premises. The study reveals that language models like Qwen and Phi-4 struggle with belief revision even when they perform well on initial reasoning tasks, showing concerning inertia patterns where models fail to update conclusions when evidence changes.

AIBullisharXiv – CS AI · Mar 276/10
🧠

TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

Researchers introduce TRAJEVAL, a diagnostic framework that breaks down AI code agent performance into three stages (search, read, edit) to identify specific failure points rather than just binary pass/fail outcomes. The framework analyzed 16,758 trajectories and found that real-time feedback based on trajectory signals improved state-of-the-art models by 2.2-4.6 percentage points while reducing costs by 20-31%.

🧠 GPT-5
AINeutralarXiv – CS AI · Mar 126/10
🧠

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

A clinical study analyzing OpenAI's GPT models found that empathy levels remained statistically unchanged across GPT-4o, o4-mini, and GPT-5-mini generations, despite user claims of 'lost empathy.' The real change was in safety posture: newer models improved crisis detection but became more cautious with advice, creating a trade-off that affects vulnerable users.

🏢 OpenAI🧠 GPT-4🧠 GPT-5
AINeutralarXiv – CS AI · Mar 116/10
🧠

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

Research reveals that LLMs heavily concentrate their confidence scores on just three round numbers when using standard 0-100 scales, with over 78% of responses showing this pattern. The study demonstrates that using a 0-20 confidence scale significantly improves metacognitive efficiency compared to the conventional 0-100 format.

AINeutralarXiv – CS AI · Mar 116/10
🧠

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering

Researchers introduced OPENXRD, a comprehensive benchmarking framework for evaluating large language models and multimodal LLMs in crystallography question answering. The study tested 74 state-of-the-art models and found that mid-sized models (7B-70B parameters) benefit most from contextual materials, while very large models often show saturation or interference.

🧠 GPT-4🧠 GPT-4.5🧠 GPT-5
AIBearishDecrypt · Mar 106/10
🧠

There's a Benchmark Test That Measures AI 'Bullshit'—Most Models Fail

BullshitBench, a new benchmark test, evaluates AI models' ability to detect nonsensical questions versus confidently providing incorrect answers. The results show most AI models fail this test, highlighting a significant reliability issue in current AI systems.

There's a Benchmark Test That Measures AI 'Bullshit'—Most Models Fail
AIBullisharXiv – CS AI · Mar 66/10
🧠

What Is Missing: Interpretable Ratings for Large Language Model Outputs

Researchers introduce the What Is Missing (WIM) rating system for Large Language Models that uses natural-language feedback instead of numerical ratings to improve preference learning. WIM computes ratings by analyzing cosine similarity between model outputs and judge feedback embeddings, producing more interpretable and effective training signals with fewer ties than traditional rating methods.

AIBullisharXiv – CS AI · Mar 37/107
🧠

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Researchers introduce CARE, a new framework for improving LLM evaluation by addressing correlated errors in AI judge ensembles. The method separates true quality signals from confounding factors like verbosity and style preferences, achieving up to 26.8% error reduction across 12 benchmarks.

AIBullisharXiv – CS AI · Mar 37/108
🧠

LitBench: A Graph-Centric Large Language Model Benchmarking Tool For Literature Tasks

Researchers have introduced LitBench, a new benchmarking tool designed to develop and evaluate domain-specific large language models for literature-related tasks. The tool uses graph-centric data curation to generate domain-specific literature sub-graphs and creates training datasets, with results showing small domain-specific LLMs achieving competitive performance against state-of-the-art models like GPT-4o.

AIBearisharXiv – CS AI · Mar 37/107
🧠

Reverse CAPTCHA: Evaluating LLM Susceptibility to Invisible Unicode Instruction Injection

Researchers developed 'Reverse CAPTCHA,' a framework that tests how large language models respond to invisible Unicode-encoded instructions embedded in normal text. The study found that AI models can follow hidden instructions that humans cannot see, with tool use dramatically increasing compliance rates and different AI providers showing distinct preferences for encoding schemes.

AINeutralarXiv – CS AI · Mar 36/107
🧠

When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation

Researchers fine-tuned the Llama 2 7B model using real patient-doctor interaction transcripts to improve medical query responses, but found significant discrepancies between automatic similarity metrics and GPT-4 evaluations. The study highlights the challenges in evaluating AI medical models and recommends human medical expert review for proper validation.

AINeutralarXiv – CS AI · Mar 37/107
🧠

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

Researchers found that machine unlearning in large language models, which aims to remove specific training data influence, is less effective in interactive settings than previously thought. Knowledge that appears forgotten in static tests can often be recovered through multi-turn conversations and self-correction interactions.

AIBearisharXiv – CS AI · Mar 36/106
🧠

Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact

Research reveals that leading foundation models (LLMs) perform poorly on real-world educational tasks despite excelling on AI benchmarks. The study found that 50% of misalignment errors are shared across models due to common pretraining approaches, with model ensembles actually worsening performance on learning outcomes.

AIBearisharXiv – CS AI · Mar 36/108
🧠

LLM Self-Explanations Fail Semantic Invariance

Research reveals that Large Language Model (LLM) self-explanations fail semantic invariance testing, showing that AI models' self-reports change based on how tasks are framed rather than actual task performance. Four frontier AI models demonstrated unreliable self-reporting when faced with semantically different but functionally identical tool descriptions, raising questions about using model self-reports as evidence of capability.

AINeutralarXiv – CS AI · Mar 37/108
🧠

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond

Researchers introduce SafeSci, a comprehensive framework for evaluating safety in large language models used for scientific applications. The framework includes a 0.25M sample benchmark and 1.5M sample training dataset, revealing critical vulnerabilities in 24 advanced LLMs while demonstrating that fine-tuning can significantly improve safety alignment.

AINeutralarXiv – CS AI · Mar 36/103
🧠

FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning

Researchers introduce FaithCoT-Bench, the first comprehensive benchmark for detecting unfaithful Chain-of-Thought reasoning in large language models. The benchmark includes over 1,000 expert-annotated trajectories across four domains and evaluates eleven detection methods, revealing significant challenges in identifying unreliable AI reasoning processes.

AINeutralarXiv – CS AI · Mar 36/104
🧠

To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

A research study of nine advanced Large Language Models reveals that Large Reasoning Models (LRMs) do not consistently outperform non-reasoning models on Theory of Mind tasks, which assess social cognition abilities. The study found that longer reasoning often hurts performance and models rely on shortcuts rather than genuine deduction, suggesting formal reasoning advances don't transfer to social reasoning tasks.

AIBullisharXiv – CS AI · Mar 36/102
🧠

Spilled Energy in Large Language Models

Researchers developed a training-free method to detect AI hallucinations by reinterpreting LLM output as Energy-Based Models and tracking 'energy spills' during text generation. The approach successfully identifies factual errors and biases across multiple state-of-the-art models including LLaMA, Mistral, and Gemma without requiring additional training or probe classifiers.

AINeutralarXiv – CS AI · Mar 36/104
🧠

GraphUniverse: Synthetic Graph Generation for Evaluating Inductive Generalization

Researchers introduce GraphUniverse, a new framework for generating synthetic graph families to evaluate how AI models generalize to unseen graph structures. The study reveals that strong performance on single graphs doesn't predict generalization ability, highlighting a critical gap in current graph learning evaluation methods.

← PrevPage 2 of 3Next →