Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship
A rigorous analysis of AI coding agents reveals that apparent benefits of human co-authorship in pull requests disappear under proper statistical controls, demonstrating how Simpson's Paradox and confounding variables can mask true causal relationships in AI agent research.
This research exposes fundamental methodological challenges in evaluating AI agent performance through observational data. The initial finding—that human co-authored PRs merge less frequently than autonomous ones (53.8% vs 79.8%)—appears counterintuitive until stratified analysis reveals it results entirely from agent composition bias. Codex dominates the dataset while rarely using co-authorship, artificially inflating the aggregate merge rate for autonomous PRs.
The cascade of confounding reveals the deeper issue: within-repository controls eliminate Devin's apparent 33.5 percentage point advantage entirely, reducing it to a statistically insignificant 1.6 points. Similar collapses occur for Copilot when PR structure variables are introduced. This pattern suggests developers likely co-author PRs with AI agents when tackling genuinely complex problems that face naturally lower merge rates, not because co-authorship itself reduces success.
For the AI development community, this demonstrates that cross-sectional analyses of agent performance can systematically mislead without proper causal reasoning. The findings underscore that agent selection bias, repository complexity, and task difficulty operate as powerful confounders that surface-level statistics cannot distinguish from genuine treatment effects. Organizations evaluating AI coding tools must account for these dynamics when interpreting performance metrics.
Looking forward, researchers should implement randomized assignments or instrumental variable approaches when possible, and always stratify reported statistics by agent type and task characteristics. This work establishes critical standards for empirical AI evaluation and highlights risks in drawing product conclusions from aggregate observational patterns.
- →Simpson's Paradox obscures the true relationship between co-authorship and PR merge rates across pooled AI agents
- →Within-agent analysis reverses conclusions for most agents, with effects vanishing entirely under repository and commit controls
- →Agent composition bias drives aggregate findings, as Codex dominates datasets while avoiding co-authorship mechanisms
- →Causal claims about AI agent performance require stratification and multivariate controls, not reliance on cross-sectional correlations
- →Developers likely co-author with AI agents on inherently difficult tasks, creating selection artifacts rather than causal benefits