y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#benchmark-framework News & Analysis

5 articles tagged with #benchmark-framework. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AIBullisharXiv – CS AI · 3d ago7/10
🧠

A Unified Framework for the Evaluation of LLM Agentic Capabilities

Researchers present a unified evaluation framework for assessing LLM agentic capabilities, integrating 7 benchmarks across 24 domains with standardized testing methodology. The framework disentangles intrinsic model performance from implementation artifacts, revealing that scaffold choices and environmental volatility significantly impact benchmark results across 15 models tested.

🏢 Meta🏢 Hugging Face
AINeutralarXiv – CS AI · Apr 157/10
🧠

Evaluating Relational Reasoning in LLMs with REL

Researchers introduce REL, a benchmark framework that evaluates relational reasoning in large language models by measuring Relational Complexity (RC)—the number of entities that must be simultaneously bound to apply a relation. The study reveals that frontier LLMs consistently degrade in performance as RC increases, exposing a fundamental limitation in higher-arity reasoning that persists even with increased compute and in-context learning.

AINeutralarXiv – CS AI · 2d ago6/10
🧠

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Researchers introduce Code-QA-Bench, an automated framework that generates repository-level code understanding benchmarks while distinguishing genuine code comprehension from documentation recall. Testing four frontier AI models reveals that code access is the primary driver of performance, while documentation provides marginal benefits, suggesting current models excel at code reasoning when source material is available.

AINeutralarXiv – CS AI · May 126/10
🧠

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Researchers introduce CalBench, a controlled evaluation framework for testing multi-agent LLM coordination in calendar scheduling scenarios where agents must negotiate shared commitments while protecting private information. The benchmark measures coordination quality, communication efficiency, fairness, and privacy leakage in decentralized systems where no single agent has complete information.

🏢 Meta
AINeutralarXiv – CS AI · May 46/10
🧠

How Frontier LLMs Adapt to Neurodivergence Context: A Measurement Framework for Surface vs. Structural Change in System-Prompted Responses

Researchers propose NDBench, a benchmark framework testing how frontier LLMs adapt outputs when given neurodivergence context in system prompts. The study finds that LLMs increase structural complexity (headings, steps, length) under explicit ND instructions, but persona assertion alone fails to suppress harmful behaviors—a critical finding for equitable AI system design.