Researchers propose extending preregistration practices from human subjects research to AI agent experiments, addressing methodological vulnerabilities introduced by the ease of iterating on model selection, prompts, and experimental settings. The paper catalogs researcher degrees of freedom that make p-hacking and selective reporting easier to exploit in AI experiments while remaining difficult to detect, and calls for journals and conferences to adopt standardized preregistration templates.
As large language models become increasingly autonomous in making real-world decisions—from financial transactions to policy recommendations—the scientific rigor of experiments studying their behavior has become critical. This arXiv paper identifies a fundamental credibility problem in an emerging research paradigm: AI agent experiments inherit methodological vulnerabilities from human subjects research but amplify them through technological affordances. The low computational cost of running multiple experimental iterations creates perverse incentives for researchers to exploit degrees of freedom in model selection, prompt engineering, hyperparameter tuning, and outcome-contingent redesign without detection. Unlike human subjects research, where institutional review boards and practical constraints enforce discipline, AI experiments operate in a largely unregulated space with minimal reporting norms.
The proposal directly addresses growing concerns about reproducibility and publication bias in AI research. As organizations deploy AI agents in consequential domains—negotiating contracts, allocating resources, making hiring decisions—understanding their actual behavioral patterns matters beyond academic curiosity. The absence of preregistration standards creates asymmetric information advantages for well-funded labs that can afford extensive experimentation and selective reporting. The paper's tailored preregistration template and calls for adoption by major venues represent a practical governance intervention in an undisciplined field.
For the broader AI ecosystem, this signals increasing pressure toward research accountability. Adoption would raise barriers to entry for underfunded researchers while potentially slowing publication velocity. However, it establishes credibility foundations necessary for AI systems to earn institutional trust as decision-makers. Success depends on whether conferences like NeurIPS and top-tier journals enforce these standards, creating a coordination problem across the research community.
- →AI agent experiments introduce new researcher degrees of freedom that enable p-hacking and selective reporting while remaining difficult to detect due to low iteration costs.
- →Preregistration practices from human subjects research should be extended to AI experiments to improve methodological credibility and reproducibility.
- →The absence of standardized reporting norms in AI research creates asymmetric advantages for well-resourced labs and incentivizes publication bias.
- →Widespread adoption requires coordination among conferences, journals, and funding agencies to establish preregistration as standard practice.
- →As AI agents increasingly make consequential decisions on behalf of organizations, rigorous experimental validation of their behavior becomes a governance priority.