#reproducibility News & Analysis

79 articles tagged with #reproducibility. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

79 articles

AINeutralarXiv – CS AI · Jun 236/10

🧠

Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent

Researchers evaluated whether structural codebase indexing improves coding agent performance by running controlled experiments with Claude Opus 4.7 across standardized benchmarks. Results show the index significantly improves code localization and task resolution rates without increasing costs, and outperforms simpler retrieval baselines, suggesting structural ranking becomes valuable for multi-file code changes.

🧠 Claude🧠 Opus

AINeutralarXiv – CS AI · Jun 236/10

🧠

AInterviewer: A Platform for Designing and Conducting AI-led Qualitative Interviews

Researchers introduce AInterviewer, an open-source platform that combines large language models with traditional survey software to conduct automated qualitative interviews while maintaining data security and reproducibility. Unlike proprietary solutions, the system runs on locally hosted models and enforces standardized question administration, addressing concerns about privacy and scientific rigor in AI-driven research.

AINeutralarXiv – CS AI · Jun 236/10

🧠

ARVO: Atlas of Reproducible Vulnerabilities for Open-Source Software

Researchers introduce ARVO, a large-scale dataset of over 6,100 reproducible vulnerabilities from open-source software projects, addressing a critical gap in security research by prioritizing reproducibility alongside scale and diversity. The dataset achieves 81% successful vulnerability reproduction and 89.4% patch identification accuracy, enabling automated analysis and direct vulnerability interaction capabilities absent in existing datasets.

AINeutralarXiv – CS AI · Jun 236/10

🧠

DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency

DEMM-Bench introduces a benchmark framework for evaluating whether evidence records in agent-runtime systems sufficiently answer governance questions about specific decisions. Using the Decision Evidence Maturity Model, researchers tested 64 cases across eight evidence regimes and found that existing baselines overclaim sufficiency in 50-75% of cases, while a property-level scorer achieved 56.25% accuracy with zero overclaims.

AINeutralarXiv – CS AI · Jun 115/10

🧠

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

SemantiClean is a modular framework that extracts semantic signals from e-commerce session data to predict purchase intent and customer behavior while prioritizing auditability and reproducibility over raw predictive accuracy. The system uses a predefined library of 24 behavioral elements organized across four layers and implements safeguards against signal inflation, representing a shift toward transparent, governance-focused AI systems over conventional black-box optimizers.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Preregistration for Experiments with AI Agents

Researchers propose extending preregistration practices from human subjects research to AI agent experiments, addressing methodological vulnerabilities introduced by the ease of iterating on model selection, prompts, and experimental settings. The paper catalogs researcher degrees of freedom that make p-hacking and selective reporting easier to exploit in AI experiments while remaining difficult to detect, and calls for journals and conferences to adopt standardized preregistration templates.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

Researchers present a staged-promotion protocol for efficiently screening machine learning configurations during micro-pretraining, using fixed budget increments across heterogeneous hardware to reduce experimental costs while mitigating the risk of selecting configurations that perform well only at tiny scales. The study demonstrates that early-stage rankings are unstable across hardware types, but a frozen promotion rule successfully identified a consistent top performer while reducing total GPU-hours from 432 to 169.2.

AINeutralarXiv – CS AI · Jun 116/10

🧠

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

Researchers introduce DuoBench, a comprehensive benchmarking framework for evaluating bimanual robotic manipulation policies on the FR3 Duo platform. The framework includes eleven tasks implemented in simulation and real-world settings, with reproducible recipes and human-teleoperated datasets that reveal significant challenges in current dual-arm AI policies, particularly in coordination and sim-to-real transfer.

AIBearisharXiv – CS AI · Jun 106/10

🧠

A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

A complementary study of PlanGPT, an LLM-based automated planning system, challenges its effectiveness by re-evaluating its performance against traditional planners using metrics like plan cost and generation time. The research questions whether planning with large language models is truly beneficial, finding that PlanGPT performs no better than basic greedy search strategies.