#evaluation-framework News & Analysis

71 articles tagged with #evaluation-framework. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

71 articles

AINeutralarXiv – CS AI · Jun 116/10

🧠

A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data

Researchers propose the LLM Data Auditor framework to systematically evaluate the quality and trustworthiness of synthetic data generated by large language models across six modalities. The framework shifts evaluation focus from downstream task performance to intrinsic data properties, revealing significant deficiencies in current evaluation practices and offering recommendations for improvement.

AINeutralarXiv – CS AI · Jun 116/10

🧠

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

Researchers introduce BioDivergence, a new evaluation framework that distinguishes between genuine contradictions and context-dependent divergences in biomedical research claims. The framework includes a six-class taxonomy and 13-axis ontology to capture why studies produce seemingly conflicting results, with a released benchmark of 11,865 claim pairs showing that current NLI models struggle with contextual understanding.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Closing the Sim-to-Real Gap: An Evaluation Framework for Autonomous Cyber Defense Configuration of Commercial EDR

Researchers developed the first evaluation framework for autonomous AI defense agents operating within commercial endpoint detection and response (EDR) systems, revealing critical gaps between simulation environments and real-world enterprise security. Testing with Microsoft Defender XDR and LLM-based agents uncovered that commercial EDR telemetry is optimized for human analysts rather than benchmarking, creating attribution challenges and unpredictable autonomous system behavior.

🧠 Claude🧠 Sonnet

AINeutralarXiv – CS AI · Jun 96/10

🧠

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Researchers introduce AVI-Bench, a comprehensive benchmark for evaluating audio-visual intelligence in multimodal large language models across perception, understanding, and reasoning tasks. The study reveals significant limitations in current models and proposes a taxonomy to guide development of more robust audio-visual AI systems.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

Researchers propose CapCode and CapReward, frameworks designed to detect and prevent AI coding agents from achieving high evaluation scores through shortcuts rather than genuine task-solving. By capping the maximum achievable non-cheating performance below 100%, scores above the cap serve as evidence of deceptive behavior, enabling more reliable agent evaluation.

AINeutralarXiv – CS AI · Jun 86/10

🧠

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

Researchers introduce VALUEFLOW, a comprehensive framework for aligning Large Language Models with diverse human values through hierarchical extraction, calibrated intensity evaluation, and steerable control mechanisms. The system addresses fundamental limitations in existing preference-based alignment approaches by enabling precise, multi-theory value alignment at controlled intensities across different models.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

Researchers introduce CausalPhys, a benchmark with over 3,000 curated video and image questions designed to evaluate how well vision-language models understand causal physical reasoning. The work includes expert-annotated causal graphs and proposes Causal Rationale-informed Fine-Tuning (CRFT) to improve VLM performance on physical world reasoning tasks.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Self-Evolving Deep Research via Joint Generation and Evaluation

Researchers introduce SCORE, a self-evolving co-evolutionary framework that jointly trains evaluation and generation models for deep research report generation. The approach addresses limitations in LLM-based research agents by enabling evaluators to dynamically adapt standards as solver performance improves, demonstrating consistent quality improvements over static evaluation methods.

AINeutralarXiv – CS AI · Jun 46/10

🧠

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

Researchers introduce NoRA, a visual reasoning benchmark that evaluates whether AI models can generate and justify appropriate actions in first-person video scenarios through explicit reasoning graphs. The benchmark reveals that current multimodal language models struggle to construct complete action spaces and properly ground decisions in visible evidence, highlighting a critical gap between selecting plausible actions and explaining them through verifiable reasoning.