y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#llm-testing News & Analysis

15 articles tagged with #llm-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

15 articles
AIBearisharXiv – CS AI · 5d ago7/10
🧠

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Researchers introduce τ-Rec, a new benchmark for evaluating conversational AI recommender systems that replaces subjective LLM-based judging with verifiable, measurable rewards. Testing across nine model configurations reveals a critical reliability gap, with even top-performing models achieving only ~57% accuracy on single-attempt tasks, exposing significant limitations in current agentic AI deployment.

🧠 GPT-5🧠 Claude🧠 Sonnet
AIBearisharXiv – CS AI · Jun 57/10
🧠

Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation

Researchers identify Search-Time Contamination (STC) in deep research agents, where web search during inference allows models to access benchmark answers and metadata, artificially inflating performance by up to 4%. The study reveals widespread contamination across six public benchmarks and calls for contamination-aware evaluation practices including sandboxed environments and transparent search tracking.

🏢 Meta
AINeutralarXiv – CS AI · Mar 267/10
🧠

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Researchers developed a graph-based evaluation framework that transforms clinical guidelines into dynamic benchmarks for testing domain-specific language models. The system addresses key evaluation challenges by providing contamination resistance, comprehensive coverage, and maintainable assessment tools that reveal systematic capability gaps in current AI models.

AINeutralarXiv – CS AI · Jun 46/10
🧠

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Researchers introduce 100-LongBench, a new evaluation framework that addresses critical flaws in existing long-context LLM benchmarks by implementing length-controllable testing and a novel metric to isolate true long-context performance from baseline model knowledge. This development enables more accurate assessment of which models genuinely handle extended contexts versus those relying on existing training data.

AIBullisharXiv – CS AI · Jun 16/10
🧠

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

Researchers introduce GLIDE, an open-source Python library that standardizes prediction-powered inference (PPI) methods for evaluating AI systems and language models. The library combines human annotation with LLM evaluations to produce unbiased estimates with valid confidence intervals, potentially reducing annotation costs while maintaining accuracy.

AINeutralarXiv – CS AI · May 286/10
🧠

STAB: Specification-driven Testing for Algorithmic Bottlenecks

STAB is a specification-driven testing pipeline that generates test cases exposing algorithmic bottlenecks by extracting constraints and injecting adversarial structures from natural language problem specifications. The method improves bottleneck detection rates from 50-57% to 71-73% across major programming languages and LLM implementations.

AINeutralarXiv – CS AI · May 286/10
🧠

When prompt perturbations break your A/B test: A valid statistical test for generative surveying

Researchers demonstrate that standard statistical hypothesis tests fail when applied to generative surveying, where LLM-based personas provide market research feedback. The study proposes a valid permutation test that accounts for prompt sensitivity and provides guidance on optimal resource allocation for this emerging research methodology.

AINeutralarXiv – CS AI · May 126/10
🧠

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

Researchers introduce SeedRG, a benchmark generation pipeline that addresses knowledge leakage in retrieval-augmented generation (RAG) evaluation by creating novel, structurally similar test instances that cannot be answered from language models' existing parametric memory. The approach tackles the critical problem of benchmark aging, where reused datasets become less effective for evaluation as their content gets absorbed into model training.

AINeutralarXiv – CS AI · Apr 156/10
🧠

DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

The first LLM Testing competition at ICSE 2026's DeepTest workshop evaluated four tools designed to benchmark an LLM-based automotive assistant, focusing on their ability to identify failure cases where the system fails to surface critical safety warnings from car manuals. The competition assessed both the effectiveness of test discovery and the diversity of identified failures, establishing a benchmark for evaluating AI testing methodologies in safety-critical applications.

AINeutralarXiv – CS AI · Apr 156/10
🧠

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Researchers propose League of LLMs (LOL), a benchmark-free evaluation framework that uses mutual peer assessment among multiple LLMs to overcome data contamination and evaluation bias issues. Testing on eight mainstream models reveals 70.7% ranking consistency while uncovering model-specific behaviors like memorization patterns and family-based scoring bias in OpenAI models.

🏢 OpenAI
AINeutralarXiv – CS AI · Apr 76/10
🧠

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

Researchers have developed a new automated pipeline that generates challenging math problems by first identifying specific mathematical concepts where LLMs struggle, then creating targeted problems to test these weaknesses. The method successfully reduced a leading LLM's accuracy from 77% to 45%, demonstrating its effectiveness at creating more rigorous benchmarks.

🧠 Llama
AINeutralarXiv – CS AI · Mar 266/10
🧠

DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

Researchers developed DepthCharge, a new framework for measuring how deeply large language models can maintain accurate responses when questioned about domain-specific knowledge. Testing across four domains revealed significant variation in model performance depth, with no single AI model dominating all areas and expensive models not always achieving superior results.

AINeutralarXiv – CS AI · Mar 266/10
🧠

LLMORPH: Automated Metamorphic Testing of Large Language Models

Researchers have developed LLMORPH, an automated testing tool for Large Language Models that uses Metamorphic Testing to identify faulty behaviors without requiring human-labeled data. The tool was tested on GPT-4, LLAMA3, and HERMES 2 across four NLP benchmarks, generating over 561,000 test executions and successfully exposing model inconsistencies.

🧠 GPT-4
AIBullisharXiv – CS AI · Mar 166/10
🧠

Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science -- A Three-Cycle Action Design Science Study

Researchers have developed PsyCogMetrics AI Lab, a cloud-based platform that applies psychometric and cognitive science methodologies to evaluate Large Language Models. The platform was created through a three-cycle Action Design Science study and aims to advance AI evaluation methods at the intersection of psychology, cognitive science, and artificial intelligence.

AINeutralarXiv – CS AI · Mar 126/10
🧠

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

Researchers have developed the System Hallucination Scale (SHS), a human-centered tool for evaluating hallucination behavior in large language models. The instrument showed strong statistical validity in testing with 210 participants and provides a practical method for assessing AI model reliability from a user perspective.