#llm-evaluation News & Analysis

Over the past month, #llm-evaluation has been the subject of 59 articles, predominantly from arXiv computer science channels, maintaining stable neutral sentiment at 74.6%. Discussion centers on assessment methods for major models including GPT-4, Llama, and Claude, with evaluation frameworks intersecting closely with broader #ai-research and #ai-safety conversations. The topic frequently overlaps with #benchmark and #ai-benchmarking discussions, reflecting ongoing work to standardize how language models are tested and compared. Scan the articles below for coverage of current evaluation approaches and their implications.

sentiment · last 30d (59 articles)

Top sources:arXiv – CS AI · 104

Often co-tagged with:#ai-research #ai-safety #benchmark #ai-benchmarking #machine-learning #benchmarking

Most-discussed entities:GPT-4 · 4Llama · 4Claude · 4GPT-5 · 4Gemini · 4

170 articles

AINeutralarXiv – CS AI · 3d ago6/10

🧠

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

Researchers have developed PetroBench, a comprehensive benchmark for evaluating large language models in petroleum engineering, testing eight mainstream LLMs across 1,200 domain-specific questions. The evaluation reveals significant performance gaps, with leading models achieving 72-74% accuracy overall but struggling particularly with factual discrimination in objective questions, suggesting LLMs need substantial improvement before widespread deployment in critical petroleum industry applications.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

Researchers propose a replication-first paradigm for evaluating subjective LLM behaviors like empathy and restraint, using four orthogonal validation properties instead of single human-rater consensus. Testing across 49 models reveals that aggregate performance scores mask significant regressions in specific behavioral dimensions, such as gpt-5's 1.87-point decline in advice-restraint compared to gpt-4.1.

🧠 GPT-4🧠 GPT-5

AINeutralarXiv – CS AI · 3d ago6/10

🧠

The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

Researchers introduce Prosecution Decision Prediction (PDP), a new legal AI benchmark that evaluates criminal liability assessment at the prosecutorial review stage rather than post-indictment. The study reveals that state-of-the-art large language models perform substantially worse on PDP tasks than traditional Legal Judgment Prediction, exposing significant gaps in AI's ability to evaluate evidence and apply legal discretion.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study

Researchers introduced ScanReQA, a new 3D spatial reasoning benchmark that evaluates how well large language models understand spatial concepts across text, 2D vision, and 3D point cloud modalities. The study reveals that current 3D LLMs struggle with binary spatial reasoning and suffer from attention sink phenomena that impairs their spatial understanding capabilities.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

Researchers introduce SuiChat-CN, a Chinese-language benchmark dataset for assessing suicide risk in group chat conversations using AI models. The dataset contains 13,312 contextual segments from Telegram, demonstrating that contextual information significantly improves risk detection accuracy compared to isolated message analysis.

AINeutralarXiv – CS AI · 3d ago5/10

🧠

ChildEval: When large language models meet children's personalities

Researchers introduce ChildEval, a benchmark dataset containing 29K synthesized persona profiles to evaluate how large language models understand and respond to children's preferences aged 3-6. The work addresses a gap in LLM evaluation by testing whether AI systems can infer and follow child-specific preferences in extended conversations, with results showing that fine-tuning on the benchmark improves child-centered performance.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

A new study comparing three LLM approaches to mathematical reasoning found that pure chain-of-thought prompting outperforms code execution methods in robustness across problem variations. When math problems were modified with simple changes like different names or numbers, code-based approaches showed greater accuracy drops, challenging the assumption that code execution improves reasoning reliability.

🧠 Claude🧠 Haiku

AINeutralarXiv – CS AI · 4d ago6/10

🧠

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

VISTA is a new benchmark for evaluating how well AI agents can generate functional web applications from visual specifications and text descriptions. The benchmark introduces five different testing conditions with varying levels of design detail and technology stack constraints, using manual annotations and multi-modal evaluation metrics to assess both visual fidelity and functional correctness.

AINeutralarXiv – CS AI · 4d ago5/10

🧠

Plans for Evaluating Structured Generative Search Summaries

Researchers propose a framework for evaluating structured generative search summaries—AI-generated overviews with sections and source citations that appear above traditional web search results. The work outlines plans for implementing and testing this evaluation methodology to assess the quality and reliability of LLM-generated search summaries.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Researchers introduce Verus-SpecGym, an evaluation environment for testing whether AI agents can automatically translate informal programming specifications into formal, machine-verifiable code. The benchmark reveals that frontier LLMs like Gemini 3.1 Pro achieve 77.8% accuracy on specification tasks, but generated specs remain brittle and frequently miss edge cases, input constraints, and validation rules that human experts catch.

🧠 Gemini

AINeutralarXiv – CS AI · 4d ago6/10

🧠

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

Researchers introduce JuICE, a multilingual benchmark dataset revealing that current LLM-judges struggle to identify cultural errors in AI-generated responses, achieving only 52% F1 scores. The study demonstrates that LLMs fail to capture nuanced cultural contexts across diverse regions, suggesting existing evaluation methods inadequately assess cultural appropriateness in global AI deployment.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

Researchers demonstrate that synthetic data generated by LLMs for patent classification shows mixed results, with improvements primarily driven by increased sample volume rather than data quality. The optimal strategy combines 20-30% real data with 70-80% synthetic data, though synthetic corpora can paradoxically harm retrieval performance despite improving classification metrics.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

Researchers introduced EconCausal, a benchmark dataset of 10,490 annotated economic causal relationships from peer-reviewed studies, revealing that large language models struggle to properly condition predictions on changing contexts—achieving 88% accuracy in fixed scenarios but dropping to 41.3% when context shifts require reversing causal directions.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning

Researchers introduced EpiQAL, the first benchmark for evaluating large language models on epidemiological reasoning tasks. Testing 15 models reveals significant performance gaps in multi-step inference and evidence synthesis, indicating current LLMs struggle with population-level disease analysis despite their general capabilities.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

Researchers introduce ORLoopBench, a benchmark suite that evaluates large language models on Operations Research tasks through an iterative solver-in-the-loop process rather than one-shot code generation. The framework enables models to debug infeasible mathematical models by inspecting constraint conflicts and repairing formulations, with an 8B model achieving 95.3% success on LP repair tasks—outperforming frontier APIs at 92.4%.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Constructing Industrial-Scale Optimization Modeling Benchmark

Researchers introduce MIPLIB-NL, a benchmark dataset of 223 industrial-scale optimization problems derived from real mixed-integer linear programs. The benchmark bridges natural-language problem descriptions with executable solver code, addressing a critical gap in evaluating large language models on realistic optimization tasks with thousands to millions of variables and constraints.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Researchers introduced Persona2Web, the first benchmark for evaluating personalized web agents that can infer user preferences from historical behavior rather than explicit instructions. The framework tests how large language models handle ambiguous queries by leveraging user context, addressing a critical gap in current web agent capabilities.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Researchers introduce OmniToM, a new benchmark for evaluating Theory of Mind capabilities in large language models by requiring explicit modeling of belief structures rather than just final answers. The benchmark reveals that current LLMs struggle with tracking actor-specific beliefs and understanding knowledge access, exposing fundamental limitations in social reasoning despite high performance on traditional end-point question answering tasks.

AINeutralarXiv – CS AI · May 126/10

🧠

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Researchers introduce DiagnosticIQ, a benchmark dataset of 6,690 expert-validated questions testing whether large language models can recommend maintenance actions based on industrial sensor rules. Evaluation of 29 LLMs reveals that while frontier models perform well on standard tasks, they exhibit significant brittleness—losing 13-60% accuracy under minor perturbations and pattern-matching rather than reasoning when conditions are inverted.

AINeutralarXiv – CS AI · May 126/10

🧠

Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

Researchers introduce a strategy-level evaluation framework for large language models on mathematical reasoning tasks, revealing a significant gap between high answer accuracy and actual reasoning flexibility. While frontier models achieve 95-100% accuracy on single-solution prompts, they recover substantially fewer problem-solving strategies than human references when asked to generate multiple approaches, with only 39-71% coverage depending on the model and iteration count.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · May 126/10

🧠

TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

Researchers introduce TIDE-Bench, a comprehensive evaluation benchmark for tool-integrated reasoning (TIR) systems that assess how well large language models leverage external tools. The benchmark addresses critical gaps in existing evaluations by combining traditional tasks with novel experimental design and interactive scenarios, measuring not just accuracy but tool efficiency and inference costs.

AINeutralarXiv – CS AI · May 126/10

🧠

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

Researchers introduced PDEAgent-Bench, the first comprehensive benchmark for evaluating AI systems that generate numerical solvers from partial differential equations (PDEs). The benchmark contains 645 test cases across multiple PDE families and finite-element libraries, revealing that while current LLMs can produce runnable code, they substantially fail when accuracy and efficiency requirements are enforced.

AINeutralarXiv – CS AI · May 126/10

🧠

The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs

Researchers introduce the Metacognitive Probe, a diagnostic tool measuring five dimensions of LLM confidence behavior including calibration, epistemic vigilance, and reasoning validation. Testing on eight frontier models and 69 humans reveals significant within-model disparities—exemplified by Gemini 2.5 Flash scoring 88 on confidence calibration but only 41 on difficulty prediction—suggesting composite benchmarks mask pockets of overconfidence.

🧠 Gemini

AINeutralarXiv – CS AI · May 126/10

🧠

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Researchers introduce FormalRewardBench, the first benchmark for evaluating reward models in formal theorem proving using Lean 4. The benchmark reveals that frontier LLMs like Claude Opus outperform specialized theorem provers at evaluating proof quality, suggesting that theorem proving ability does not transfer to proof evaluation tasks.

🧠 Claude🧠 Opus

AINeutralarXiv – CS AI · May 126/10

🧠

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Researchers introduced Magis-Bench, a new benchmark for evaluating large language models on magistrate-level judicial tasks based on Brazilian competitive exams. Testing 23 state-of-the-art LLMs revealed that even top performers like Google's Gemini-3-Pro-Preview score below 70% on complex legal reasoning and judicial writing tasks, indicating significant gaps in AI legal capabilities.

🧠 Claude🧠 Gemini

← PrevPage 3 of 7Next →