How Much Coordination Gain Is Real? A Paired Noise-Floor Protocol for Multi-Agent LLM Benchmarks
A technical study challenges the validity of reported improvements in multi-agent LLM coordination architectures by establishing a noise-floor baseline using Claude Haiku. The research reveals that paired configuration-equivalent trials produce statistical gaps of ±5pp at best, suggesting that seven of ten recent coordination papers report headline effects within or below this noise floor, raising questions about reproducibility and the actual gains from proposed architectures.
This paper addresses a critical methodological gap in multi-agent LLM research: the lack of controlled baselines for measuring genuine coordination gains. The authors conducted paired experiments on identical model configurations, isolating protocol differences through code inspection and cryptographic verification. On Claude Haiku against tau²-bench retail tasks, they found that supposedly "inert" baseline protocols produced performance variations of +10pp and 0pp across two experimental runs, with a pooled effect of +5pp and a 95% confidence interval spanning -2 to +12pp—statistically indistinguishable from zero.
The significance lies in what this reveals about existing literature. When the authors measured the largest observed single-seed effect (+18pp), it vanished entirely in replication (-3pp), demonstrating the fragility of unreplicated results. Most critically, they catalog how seven of ten recent multi-agent coordination architectures report improvements below this local noise floor, with one additional architecture sitting squarely within the noise envelope. This suggests the field may be suffering from publication bias and underpowered designs rather than documenting genuine coordination breakthroughs.
The proposed solution—"coordination-active pass^k" as a minimum reporting standard—aims to shift the field toward more rigorous measurement practices. The authors introduce ET-MCP as a substrate for isolating experimental choices and provide preliminary diagnostics on why candidate readers (pull vs. intercept mechanisms) failed to improve trial-1 recovery on Haiku 4.5. For AI researchers and practitioners, this work serves as a cautionary tale about accepting small effect sizes without considering noise floors and replication. The implications extend beyond academic credibility to practical system design, where false coordination improvements could drive architectural decisions in production LLM deployments.
- →Seven of ten recent multi-agent LLM coordination papers report improvements below the empirically measured noise floor of approximately ±5pp.
- →Paired trial-0 disagreement on identical model configurations produces unpredictable variations that rarely survive replication across seeds.
- →The largest observed single-seed coordination effect (+18pp) failed to replicate in a second seed (-3pp), indicating high variance or selection bias.
- →Existing multi-agent coordination benchmarks lack controlled baselines and systematic replication protocols needed to validate claimed architectural improvements.
- →The proposed "coordination-active pass^k" reporting standard aims to establish minimum methodological requirements for multi-agent LLM coordination research.