#benchmark-generation News & Analysis

6 articles tagged with #benchmark-generation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

BenchEvolver is an AI framework that automatically generates harder variants of existing coding problems to address benchmark saturation, where frontier LLMs now achieve 99% accuracy on standard tests. By evolving solutions rather than creating problems from scratch, it produces verifiable, diverse tasks that maintain challenge even for their generating models, enabling both better evaluation and improved training signals.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Project Auto-World: Towards Automated Benchmarking of Neural Relational Reasoners

Researchers demonstrate using large language models to automate the generation of increasingly difficult benchmark instances for testing neural reasoning systems. The approach combines LLM-driven evolutionary search with an Edge Transformer evaluator, enabling automated discovery of challenging problem instances and improvements in model generalization without manual benchmark creation.

AINeutralarXiv – CS AI · Jun 96/10

🧠

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Researchers introduced PIPE-Cypher, an automated pipeline for generating Text-to-Cypher benchmarks tailored to enterprise property graphs. The system combines schema profiling, LLM generation, and validation to create deployment-relevant datasets that reflect real user queries, addressing the challenge that enterprise graphs have unique structures and evolving schemas that make standardized benchmarks inadequate.

AINeutralarXiv – CS AI · May 126/10

🧠

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

Researchers introduce SeedRG, a benchmark generation pipeline that addresses knowledge leakage in retrieval-augmented generation (RAG) evaluation by creating novel, structurally similar test instances that cannot be answered from language models' existing parametric memory. The approach tackles the critical problem of benchmark aging, where reused datasets become less effective for evaluation as their content gets absorbed into model training.

AINeutralarXiv – CS AI · Apr 146/10

🧠

TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

Researchers introduce TimeSeriesExamAgent, a scalable framework for automatically generating time series reasoning benchmarks using LLM agents and templates. The study reveals that while large language models show promise in time series tasks, they significantly underperform in abstract reasoning and domain-specific applications across healthcare, finance, and weather domains.

AINeutralarXiv – CS AI · Apr 76/10

🧠

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

Researchers have developed a new automated pipeline that generates challenging math problems by first identifying specific mathematical concepts where LLMs struggle, then creating targeted problems to test these weaknesses. The method successfully reduced a leading LLM's accuracy from 77% to 45%, demonstrating its effectiveness at creating more rigorous benchmarks.

🧠 Llama