y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reproducibility News & Analysis

40 articles tagged with #reproducibility. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

40 articles
AINeutralarXiv – CS AI · 4d ago6/10
🧠

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

A comprehensive systematic review of 139 studies reveals that multimodal information fusion improves document classification accuracy by 5.28 percentage points, while multiview approaches provide modest gains of 4.67%. The research identifies critical gaps in methodological rigor, with less than 24% of studies employing statistical validation, highlighting the need for more robust research standards in the field.

AINeutralarXiv – CS AI · May 126/10
🧠

Unpredictability dissociates from structured control in language agents

Researchers demonstrate that unpredictability in language agents does not equate to effective control, finding that structured decision-making mechanisms significantly outperform stochastic sampling across 74,352 test cases. The study challenges assumptions about randomness and control in AI systems, with implications for agent reliability and interpretability.

AINeutralarXiv – CS AI · May 116/10
🧠

Same Brain, Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability

Researchers demonstrate that EEG-based deep learning models produce unstable predictions when preprocessing pipelines change, with up to 42% of predictions flipping across different preprocessing choices. The study introduces three tools—Walsh-Hadamard decomposition, Preprocessing Uncertainty metrics, and a regularization approach—to measure and mitigate this instability, revealing a critical reliability gap in brain-computer interface systems.

AINeutralarXiv – CS AI · May 116/10
🧠

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

Researchers introduce Mage, a multi-axis evaluation framework that reveals compile-pass rate is a misleading metric for assessing LLM-generated code in complex domains. Testing across four open-weight language models on game scene synthesis, they find direct code generation achieves 43% runtime success but produces structurally invalid outputs, while IR-conditioned approaches recover functional correctness at the cost of lower raw execution rates.

AINeutralarXiv – CS AI · May 116/10
🧠

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

Researchers introduce CyBiasBench, a benchmark revealing that LLM agents deployed for cybersecurity attacks exhibit inherent biases toward specific attack families regardless of prompting. The study demonstrates agents resist steering away from their preferred attack patterns, suggesting these biases are fundamental agent characteristics rather than prompt-dependent behaviors.

AIBearisharXiv – CS AI · May 16/10
🧠

Beyond Accuracy: LLM Variability in Evidence Screening for Software Engineering SLRs

A comprehensive study comparing 12 large language models against 4 classical classifiers for automating evidence screening in software engineering systematic literature reviews reveals that LLMs exhibit significant performance variability and lack consistent superiority over traditional methods. The research emphasizes that abstract availability is critical for LLM performance, while title and keywords provide minimal additional value, suggesting LLM adoption should be driven by operational constraints rather than performance guarantees.

🏢 OpenAI🏢 Anthropic🧠 Gemini
AIBearisharXiv – CS AI · Apr 206/10
🧠

The threat of analytic flexibility in using large language models to simulate human data

A new study reveals that using large language models to generate synthetic datasets ("silicon samples") produces highly variable results depending on configuration choices, with correlation outcomes ranging from r=.23 to r=.84 on the same task. This demonstrates that analytic flexibility in LLM-based data generation poses a significant threat to research validity and reproducibility in social science applications.

AINeutralarXiv – CS AI · Apr 146/10
🧠

COMPOSITE-Stem

Researchers introduced COMPOSITE-STEM, a new benchmark containing 70 expert-written scientific tasks across physics, biology, chemistry, and mathematics to evaluate AI agents. The top-performing model achieved only 21% accuracy, indicating the benchmark effectively measures capabilities beyond current AI reach and addresses the saturation of existing evaluation frameworks.

AINeutralarXiv – CS AI · Apr 146/10
🧠

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

TorchUMM is an open-source unified codebase designed to standardize evaluation, analysis, and post-training of multimodal AI models across diverse architectures. The framework addresses fragmentation in the field by providing a single interface for benchmarking models on vision-language understanding, generation, and editing tasks, enabling reproducible comparisons and accelerating development of more capable multimodal systems.

🏢 Meta
AINeutralarXiv – CS AI · Apr 146/10
🧠

Inspectable AI for Science: A Research Object Approach to Generative AI Governance

Researchers propose AI as a Research Object (AI-RO), a governance framework that treats generative AI interactions as inspectable, documented components of scientific research rather than debating authorship. The framework combines interaction logs, metadata packaging, and provenance records to ensure accountability, particularly for security and privacy research where confidentiality and auditability are critical.

🏢 Meta
AINeutralarXiv – CS AI · Apr 136/10
🧠

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

Researchers introduce ReplicatorBench, a comprehensive benchmark for evaluating AI agents' ability to replicate scientific research claims in social and behavioral sciences. The study reveals that current LLM agents excel at designing and executing experiments but struggle significantly with data retrieval, highlighting critical gaps in autonomous research validation capabilities.

AINeutralarXiv – CS AI · Apr 76/10
🧠

Reproducibility study on how to find Spurious Correlations, Shortcut Learning, Clever Hans or Group-Distributional non-robustness and how to fix them

A reproducibility study unifies research on spurious correlations in deep neural networks across different domains, comparing correction methods including XAI-based approaches. The research finds that Counterfactual Knowledge Distillation (CFKD) most effectively improves model generalization, though practical deployment remains challenging due to group labeling dependencies and data scarcity issues.

AINeutralarXiv – CS AI · Mar 26/1014
🧠

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Researchers introduce Jailbreak Foundry (JBF), a system that automatically converts AI jailbreak research papers into executable code modules for standardized testing. The system successfully reproduced 30 attacks with high accuracy and reduces implementation code by nearly half while enabling consistent evaluation across multiple AI models.

AINeutralarXiv – CS AI · Mar 34/105
🧠

Agentic Scientific Simulation: Execution-Grounded Model Construction and Reconstruction

Researchers introduce JutulGPT, an AI agent system for physics-based simulation that addresses the problem of underspecified natural language descriptions in scientific modeling. The system uses an execution-grounded approach where the simulator validates physical accuracy, but reveals limitations in tracking tacit assumptions made through simulator defaults.

← PrevPage 2 of 2