Analytics Digests Sources Topics RSS AI Crypto

#ai-evaluation News & Analysis

Coverage of #ai-evaluation has remained relatively stable over the past month, with 32 articles added in the last 30 days out of 160 total indexed. The discussion leans heavily neutral at 71.9%, while bullish sentiment accounts for 9.4% and bearish views represent 18.8%, marking only a slight 3.5 percentage point shift in bullish sentiment compared to the previous 90-day period. Academic research dominates the conversation, with arXiv's computer science and AI sections contributing the vast majority of indexed articles. Recent discussions frequently center on major language models including GPT-5, Gemini, and Claude. Related coverage typically intersects with #benchmark, #machine-learning, #research, and #llm topics. Scan the articles below for the latest developments in this area.

sentiment · last 30d (32 articles)

Top sources:arXiv – CS AI · 120Decrypt · 1Fortune Crypto · 1MIT News – AI · 1Hugging Face Blog · 1

Often co-tagged with:#benchmark #machine-learning #research #llm #ai-research #language-models

Most-discussed entities:GPT-5 · 8Gemini · 8Claude · 7Llama · 5GPT-4 · 5

308 articles

AINeutralarXiv – CS AI · May 276/10

🧠

Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

Researchers introduce Persona Generators, AI functions that create diverse synthetic populations for evaluating AI systems across varied user demographics without needing extensive real-world data collection. Using iterative optimization with large language models, the approach generates lightweight code that produces synthetic personas spanning rare trait combinations and long-tail behaviors, outperforming existing baselines on diversity metrics.

AINeutralarXiv – CS AI · May 126/10

🧠

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Researchers introduce CalBench, a controlled evaluation framework for testing multi-agent LLM coordination in calendar scheduling scenarios where agents must negotiate shared commitments while protecting private information. The benchmark measures coordination quality, communication efficiency, fairness, and privacy leakage in decentralized systems where no single agent has complete information.

🏢 Meta

AINeutralarXiv – CS AI · May 126/10

🧠

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

Researchers introduce Re²Math, a new benchmark for evaluating large language models' ability to retrieve relevant mathematical theorems and lemmas from academic literature during proof construction. The benchmark reveals significant gaps in current AI systems, with the best model achieving only 7.0% accuracy despite retrieving valid statements, indicating AI struggles to verify applicability to specific proof contexts.

AINeutralarXiv – CS AI · May 126/10

🧠

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Researchers introduce SeePhys Pro, a benchmark revealing that advanced AI models significantly degrade in physics reasoning when visual information replaces text, with visual grounding as the primary failure point. The study further demonstrates that multimodal reinforcement learning improvements can stem from non-visual textual cues rather than genuine visual understanding, challenging current evaluation methodologies.

AINeutralarXiv – CS AI · May 126/10

🧠

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

Researchers introduce Absurd World, a benchmarking framework that tests large language models' logical reasoning by creating logically coherent but unrealistic scenarios derived from real-world problems. The framework reveals whether LLMs can reason independently of learned patterns by breaking down real-world models into symbols, actions, sequences, and events, then systematically altering them while preserving underlying logic.

AINeutralarXiv – CS AI · May 126/10

🧠

MaD Physics: Evaluating information seeking under constraints in physical environments

Researchers introduce MaD Physics, a benchmark for evaluating AI agents' ability to conduct scientific discovery under realistic resource constraints. The benchmark tests agents' capacity to make informative measurements within budget limits and infer underlying physical laws, using altered physics environments to prevent reliance on training data.

🧠 Gemini

AINeutralarXiv – CS AI · May 126/10

🧠

The Generalized Turing Test: A Foundation for Comparing Intelligence

Researchers introduce the Generalized Turing Test (GTT), a formal framework for comparing AI agent capabilities through indistinguishability rather than fixed benchmarks. The framework defines a comparator where one agent is deemed superior if another agent cannot reliably distinguish between interactions with it versus interactions with itself, creating a dataset-agnostic evaluation method validated across modern AI models.

AINeutralarXiv – CS AI · May 126/10

🧠

Understanding Asynchronous Inference Methods for Vision-Language-Action Models

Researchers present a systematic comparison of four asynchronous inference methods designed to reduce latency issues in Vision-Language-Action robot control models. The study benchmarks A2C2, IT-RTC, TT-RTC, and VLASH across standardized conditions, finding that A2C2's residual correction approach performs most consistently across varying delay scenarios.

AINeutralarXiv – CS AI · May 126/10

🧠

ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

ReplaySCM introduces a 1,300-item benchmark for evaluating how well language models can infer causal mechanisms from limited intervention data. The benchmark tests whether AI systems can output executable Boolean causal models that generalize to unseen intervention scenarios, revealing that frontier LLMs struggle significantly when structural information is hidden.

AINeutralarXiv – CS AI · May 126/10

🧠

VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

Researchers introduce VeriContest, a benchmark of 946 competitive-programming problems designed to evaluate AI models' ability to generate not just functional code but also formal specifications and machine-checkable proofs. Testing ten state-of-the-art models reveals a dramatic capability gap: while the strongest model achieves 92% accuracy on code generation alone, performance plummets to 48% on specifications, 14% on proofs, and just 5% end-to-end, identifying proof generation as the critical bottleneck for verifiable code generation systems.

AINeutralarXiv – CS AI · May 126/10

🧠

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

Researchers introduce PrepBench, a new benchmark for evaluating how well large language models can handle natural language-driven data preparation tasks. The benchmark reveals that despite recent LLM advances, current models still struggle significantly with translating user intent into executable data preparation workflows, particularly when handling ambiguous requirements and complex real-world datasets.

AINeutralarXiv – CS AI · May 126/10

🧠

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

Researchers introduce SeedRG, a benchmark generation pipeline that addresses knowledge leakage in retrieval-augmented generation (RAG) evaluation by creating novel, structurally similar test instances that cannot be answered from language models' existing parametric memory. The approach tackles the critical problem of benchmark aging, where reused datasets become less effective for evaluation as their content gets absorbed into model training.

AINeutralarXiv – CS AI · May 116/10

🧠

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

TeamBench is a new benchmark evaluating multi-agent AI coordination under enforced role separation, revealing that prompt-only instructions fail to prevent role violations and that agent teams often underperform single agents on well-solved tasks. The study demonstrates that passing rates can mask coordination failures and misaligned team dynamics.

AINeutralarXiv – CS AI · May 116/10

🧠

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

Researchers introduce ChemCost, a benchmark for evaluating LLM agents on chemical cost estimation from reaction descriptions. The study reveals that even frontier LLMs achieve only 50.6% accuracy on clean inputs and degrade significantly with realistic noise, exposing brittleness in parsing, evidence integration, and tool use despite access to domain-specific tools.

AINeutralarXiv – CS AI · May 116/10

🧠

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

Researchers present a scale-conditioned evaluation protocol for AI agent memory systems that tests whether stored evidence remains usable as irrelevant data accumulates. Testing across multiple memory architectures and language models reveals that reliability degrades unpredictably with scale, with some models exceeding computational budgets while others maintain performance, suggesting memory scalability claims must be conditioned on specific agent-interface-scale combinations.

AINeutralarXiv – CS AI · May 116/10

🧠

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Researchers introduced AgentEscapeBench, a benchmark that evaluates how well LLM-based agents can reason through complex, multi-step tasks requiring external tool use and long-range dependency tracking. Testing 16 LLM agents against 270 escape-room-style problems revealed significant performance degradation as task complexity increased, with the best models dropping from 90% success to 60% as dependency depth tripled, highlighting a critical limitation in current AI agent capabilities.

AINeutralarXiv – CS AI · May 116/10

🧠

The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks

Researchers challenge the assumption that the 'Translation Tax'—a uniform penalty in translated multilingual benchmarks—operates as a simple scalar. Through counterfactual analysis of English-to-Chinese translations, they find translation quality effects are heterogeneous, model-dependent, and item-specific rather than uniform across benchmarks.

AINeutralarXiv – CS AI · May 116/10

🧠

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

Researchers introduced DRIP-R, a benchmark designed to evaluate how large language model-based agents handle ambiguous retail policies where multiple valid interpretations exist. The study reveals that frontier AI models fundamentally disagree on identical policy-ambiguous scenarios, exposing a critical gap in agent decision-making capabilities for real-world applications.

AINeutralarXiv – CS AI · May 116/10

🧠

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

Researchers introduce CoCoReviewBench, a new benchmark dataset of 3,900 papers from ICLR and NeurIPS designed to reliably evaluate AI review systems. The benchmark addresses critical gaps in current evaluation methods by prioritizing correctness over mere overlap with human reviews, revealing that existing AI reviewers struggle with hallucinations and reasoning accuracy.

AINeutralarXiv – CS AI · May 116/10

🧠

Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

Researchers propose a standardized methodology for evaluating AI systems by transforming real-world use cases into detailed evaluation scenarios, addressing inconsistencies in AI measurement across industries. The work demonstrates this framework in financial services, generating 107 scenarios from six key use cases through structured worksheets and iterative human review.

AINeutralarXiv – CS AI · May 116/10

🧠

Benchmarking World-Model Learning with Environment-Level Queries

Researchers introduce WorldTest, a new evaluation protocol for assessing whether AI agents learn general-purpose world models capable of answering diverse environment-level queries. AutumnBench, an instantiation of this framework, benchmarks 43 grid-world environments across 129 tasks and reveals that frontier AI models significantly underperform humans, with gaps attributed to differences in exploration and belief-updating strategies.

AINeutralarXiv – CS AI · May 96/10

🧠

Making AI Evaluation Deployment Relevant Through Context Specification

Researchers propose 'context specification' as a methodology to improve AI evaluation practices by translating stakeholder priorities into measurable, observable constructs. The approach aims to bridge the gap between standardized AI testing and real-world deployment outcomes, addressing widespread organizational struggles to extract value from AI investments.

AINeutralDecrypt · May 46/10

🧠

US Government Says China's Best AI Models Lag Behind. Experts Aren't So Sure

The US National Institute of Standards and Technology (NIST) evaluated DeepSeek V4 Pro and concluded that Chinese AI models lag behind US counterparts, but the methodology has drawn significant criticism. Experts question the use of private benchmarks and a cost-comparison filter that conveniently excluded all US models except GPT-5.4 mini, suggesting the evaluation may be politically motivated rather than scientifically rigorous.

US Government Says China's Best AI Models Lag Behind. Experts Aren't So Sure

🧠 GPT-5

AINeutralarXiv – CS AI · May 46/10

🧠

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

Researchers introduced ARMOR 2025, a military-focused safety benchmark for evaluating large language models against military doctrines including the Law of War and Rules of Engagement. The benchmark tests 21 commercial LLMs across 519 doctrinally grounded prompts organized in a 12-category taxonomy, revealing significant safety alignment gaps for defense applications.

AINeutralarXiv – CS AI · May 46/10

🧠

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Researchers benchmarked leading multimodal AI models (GPT-4o, Gemini, Claude, etc.) against standard computer vision tasks and found they perform as respectable generalists but lag significantly behind specialized models. The study reveals these foundation models excel at semantic tasks but struggle with geometric understanding, with GPT-4o leading non-reasoning models while reasoning variants show promise on 3D tasks.

🧠 GPT-4🧠 Claude🧠 Gemini

← PrevPage 8 of 13Next →

Tag Connections

#geopolitical↔#iran

295

#iran↔#market

220

170

#geopolitical↔#market

144

141

#bitcoin↔#market

108

#fed↔#inflation

104

#iran↔#security

95

86

#market↔#trump

78

Tag Sentiment

#market1326 articles

#ai1013 articles

#iran850 articles

#geopolitical518 articles

#bitcoin403 articles

#trump322 articles

#security279 articles

#inflation231 articles

#fed205 articles

#trading195 articles

BullishNeutralBearish

◆ AI Mentions

🏢OpenAI

138×

🏢Anthropic

90×

🏢Nvidia

69×

🧠Claude

60×

🧠GPT-5

52×

🧠Gemini

34×

🧠ChatGPT

34×

🏢Meta

22×

🧠Grok

14×

🏢Google

13×

🏢Hugging Face

12×

🧠GPT-4

12×

🏢Perplexity

10×

🧠Opus

10×

🧠Llama

8×

🏢xAI

8×

🧠Sonnet

5×

🏢Microsoft

5×

🧠Copilot

2×

🧠Sora

1×

Stay Updated

Everything combined

▲ Trending Tags

1#market1326 2#ai1013 3#iran850 4#geopolitical518 5#bitcoin403 6#trump322 7#security279 8#inflation231 9#fed205 10#trading195 11#adoption161 12#china139 13#stablecoin139 14#openai137 15#ethereum126

Filters

Sentiment

Importance

Sort

📡 See all 70+ sources

y0.exchange

Your AI agent for DeFi

Connect Claude or GPT to your wallet. AI reads balances, proposes swaps and bridges — you approve. Your keys never leave your device.

8 MCP tools · 15 chains · $0 fees

Connect Wallet to AI →How it works →

Viewing: y0 Digest feed