AIBearisharXiv – CS AI · 5d ago7/10
🧠Researchers introduce τ-Rec, a new benchmark for evaluating conversational AI recommender systems that replaces subjective LLM-based judging with verifiable, measurable rewards. Testing across nine model configurations reveals a critical reliability gap, with even top-performing models achieving only ~57% accuracy on single-attempt tasks, exposing significant limitations in current agentic AI deployment.
🧠 GPT-5🧠 Claude🧠 Sonnet
AIBearisharXiv – CS AI · Jun 57/10
🧠Researchers identify Search-Time Contamination (STC) in deep research agents, where web search during inference allows models to access benchmark answers and metadata, artificially inflating performance by up to 4%. The study reveals widespread contamination across six public benchmarks and calls for contamination-aware evaluation practices including sandboxed environments and transparent search tracking.
🏢 Meta
AINeutralarXiv – CS AI · Mar 267/10
🧠Researchers developed a graph-based evaluation framework that transforms clinical guidelines into dynamic benchmarks for testing domain-specific language models. The system addresses key evaluation challenges by providing contamination resistance, comprehensive coverage, and maintainable assessment tools that reveal systematic capability gaps in current AI models.
AINeutralarXiv – CS AI · Jun 46/10
🧠Researchers introduce 100-LongBench, a new evaluation framework that addresses critical flaws in existing long-context LLM benchmarks by implementing length-controllable testing and a novel metric to isolate true long-context performance from baseline model knowledge. This development enables more accurate assessment of which models genuinely handle extended contexts versus those relying on existing training data.
AIBullisharXiv – CS AI · Jun 16/10
🧠Researchers introduce GLIDE, an open-source Python library that standardizes prediction-powered inference (PPI) methods for evaluating AI systems and language models. The library combines human annotation with LLM evaluations to produce unbiased estimates with valid confidence intervals, potentially reducing annotation costs while maintaining accuracy.
AINeutralarXiv – CS AI · May 286/10
🧠STAB is a specification-driven testing pipeline that generates test cases exposing algorithmic bottlenecks by extracting constraints and injecting adversarial structures from natural language problem specifications. The method improves bottleneck detection rates from 50-57% to 71-73% across major programming languages and LLM implementations.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers demonstrate that standard statistical hypothesis tests fail when applied to generative surveying, where LLM-based personas provide market research feedback. The study proposes a valid permutation test that accounts for prompt sensitivity and provides guidance on optimal resource allocation for this emerging research methodology.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce SeedRG, a benchmark generation pipeline that addresses knowledge leakage in retrieval-augmented generation (RAG) evaluation by creating novel, structurally similar test instances that cannot be answered from language models' existing parametric memory. The approach tackles the critical problem of benchmark aging, where reused datasets become less effective for evaluation as their content gets absorbed into model training.
AINeutralarXiv – CS AI · Apr 156/10
🧠The first LLM Testing competition at ICSE 2026's DeepTest workshop evaluated four tools designed to benchmark an LLM-based automotive assistant, focusing on their ability to identify failure cases where the system fails to surface critical safety warnings from car manuals. The competition assessed both the effectiveness of test discovery and the diversity of identified failures, establishing a benchmark for evaluating AI testing methodologies in safety-critical applications.
AINeutralarXiv – CS AI · Apr 156/10
🧠Researchers propose League of LLMs (LOL), a benchmark-free evaluation framework that uses mutual peer assessment among multiple LLMs to overcome data contamination and evaluation bias issues. Testing on eight mainstream models reveals 70.7% ranking consistency while uncovering model-specific behaviors like memorization patterns and family-based scoring bias in OpenAI models.
🏢 OpenAI
AINeutralarXiv – CS AI · Apr 76/10
🧠Researchers have developed a new automated pipeline that generates challenging math problems by first identifying specific mathematical concepts where LLMs struggle, then creating targeted problems to test these weaknesses. The method successfully reduced a leading LLM's accuracy from 77% to 45%, demonstrating its effectiveness at creating more rigorous benchmarks.
🧠 Llama
AINeutralarXiv – CS AI · Mar 266/10
🧠Researchers developed DepthCharge, a new framework for measuring how deeply large language models can maintain accurate responses when questioned about domain-specific knowledge. Testing across four domains revealed significant variation in model performance depth, with no single AI model dominating all areas and expensive models not always achieving superior results.
AINeutralarXiv – CS AI · Mar 266/10
🧠Researchers have developed LLMORPH, an automated testing tool for Large Language Models that uses Metamorphic Testing to identify faulty behaviors without requiring human-labeled data. The tool was tested on GPT-4, LLAMA3, and HERMES 2 across four NLP benchmarks, generating over 561,000 test executions and successfully exposing model inconsistencies.
🧠 GPT-4
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers have developed PsyCogMetrics AI Lab, a cloud-based platform that applies psychometric and cognitive science methodologies to evaluate Large Language Models. The platform was created through a three-cycle Action Design Science study and aims to advance AI evaluation methods at the intersection of psychology, cognitive science, and artificial intelligence.
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers have developed the System Hallucination Scale (SHS), a human-centered tool for evaluating hallucination behavior in large language models. The instrument showed strong statistical validity in testing with 210 participants and provides a practical method for assessing AI model reliability from a user perspective.