#llm-testing News & Analysis

8 articles tagged with #llm-testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AINeutralarXiv – CS AI · Mar 267/10

🧠

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Researchers developed a graph-based evaluation framework that transforms clinical guidelines into dynamic benchmarks for testing domain-specific language models. The system addresses key evaluation challenges by providing contamination resistance, comprehensive coverage, and maintainable assessment tools that reveal systematic capability gaps in current AI models.

AINeutralarXiv – CS AI · Apr 156/10

🧠

DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

The first LLM Testing competition at ICSE 2026's DeepTest workshop evaluated four tools designed to benchmark an LLM-based automotive assistant, focusing on their ability to identify failure cases where the system fails to surface critical safety warnings from car manuals. The competition assessed both the effectiveness of test discovery and the diversity of identified failures, establishing a benchmark for evaluating AI testing methodologies in safety-critical applications.

AINeutralarXiv – CS AI · Apr 156/10

🧠

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Researchers propose League of LLMs (LOL), a benchmark-free evaluation framework that uses mutual peer assessment among multiple LLMs to overcome data contamination and evaluation bias issues. Testing on eight mainstream models reveals 70.7% ranking consistency while uncovering model-specific behaviors like memorization patterns and family-based scoring bias in OpenAI models.

🏢 OpenAI

AINeutralarXiv – CS AI · Apr 76/10

🧠

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

Researchers have developed a new automated pipeline that generates challenging math problems by first identifying specific mathematical concepts where LLMs struggle, then creating targeted problems to test these weaknesses. The method successfully reduced a leading LLM's accuracy from 77% to 45%, demonstrating its effectiveness at creating more rigorous benchmarks.

🧠 Llama

AINeutralarXiv – CS AI · Mar 266/10

🧠

DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

Researchers developed DepthCharge, a new framework for measuring how deeply large language models can maintain accurate responses when questioned about domain-specific knowledge. Testing across four domains revealed significant variation in model performance depth, with no single AI model dominating all areas and expensive models not always achieving superior results.

AINeutralarXiv – CS AI · Mar 266/10

🧠

LLMORPH: Automated Metamorphic Testing of Large Language Models

Researchers have developed LLMORPH, an automated testing tool for Large Language Models that uses Metamorphic Testing to identify faulty behaviors without requiring human-labeled data. The tool was tested on GPT-4, LLAMA3, and HERMES 2 across four NLP benchmarks, generating over 561,000 test executions and successfully exposing model inconsistencies.

🧠 GPT-4

AIBullisharXiv – CS AI · Mar 166/10

🧠

Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science -- A Three-Cycle Action Design Science Study

Researchers have developed PsyCogMetrics AI Lab, a cloud-based platform that applies psychometric and cognitive science methodologies to evaluate Large Language Models. The platform was created through a three-cycle Action Design Science study and aims to advance AI evaluation methods at the intersection of psychology, cognitive science, and artificial intelligence.

AINeutralarXiv – CS AI · Mar 126/10

🧠

The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

Researchers have developed the System Hallucination Scale (SHS), a human-centered tool for evaluating hallucination behavior in large language models. The instrument showed strong statistical validity in testing with 210 participants and provides a practical method for assessing AI model reliability from a user perspective.