y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#benchmark-generation News & Analysis

3 articles tagged with #benchmark-generation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles
AINeutralarXiv – CS AI · May 126/10
🧠

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

Researchers introduce SeedRG, a benchmark generation pipeline that addresses knowledge leakage in retrieval-augmented generation (RAG) evaluation by creating novel, structurally similar test instances that cannot be answered from language models' existing parametric memory. The approach tackles the critical problem of benchmark aging, where reused datasets become less effective for evaluation as their content gets absorbed into model training.

AINeutralarXiv – CS AI · Apr 146/10
🧠

TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

Researchers introduce TimeSeriesExamAgent, a scalable framework for automatically generating time series reasoning benchmarks using LLM agents and templates. The study reveals that while large language models show promise in time series tasks, they significantly underperform in abstract reasoning and domain-specific applications across healthcare, finance, and weather domains.

AINeutralarXiv – CS AI · Apr 76/10
🧠

Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

Researchers have developed a new automated pipeline that generates challenging math problems by first identifying specific mathematical concepts where LLMs struggle, then creating targeted problems to test these weaknesses. The method successfully reduced a leading LLM's accuracy from 77% to 45%, demonstrating its effectiveness at creating more rigorous benchmarks.

🧠 Llama