LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies
Researchers conducted a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software design, running 520 tests across 8 tasks. Structural adversarial prompting ranked first, cross-model review second, while parallel merge approaches performed poorly due to token limitations and design fragmentation issues.
This research addresses a critical gap in understanding how multiple large language models collaborate effectively on complex tasks. The controlled factorial design methodology—manipulating Authority, Roles, and Dynamics across 520 experimental runs—represents rigorous empirical work in an area often dominated by anecdotal observations. The finding that adversarial topologies (v4b) outperform cooperative approaches challenges conventional assumptions about LLM collaboration, suggesting that structured disagreement and mandatory design rewrites produce higher-quality software architecture outputs than consensus-seeking methods.
The cross-model review topology's consistent second-place ranking across three independent evaluators (GPT-OSS 120B, Claude Opus, Claude Sonnet) points toward a practical framework for production systems: leverage one model's generative strength while employing a different model's critical lens. This heterogeneous evaluation approach itself yields insight—the sharp disagreement between Claude and GPT-OSS on topology v2b (d=1.44 vs d=0.45) demonstrates that different model families weight design qualities distinctly, a finding with implications for practitioners selecting evaluation frameworks.
The complete failure of parallel merge strategies due to "token starvation and the Frankenstein effect" provides direct guidance against naive parallelization approaches. These findings matter for AI engineers building multi-agent systems, as they suggest topology selection dramatically impacts output quality—a 1.0-point rubric gap (4.637 vs 3.65) between top and bottom approaches represents substantial practical difference in software design quality. The weighted ensemble methodology (2×Opus + 2×Sonnet + 1×GPT-OSS) offers a replicable evaluation standard for future multi-agent work.
- →Structural adversarial topologies with mandatory rewrites outperform cooperative designs by 1.0+ points on 5-point rubrics
- →Cross-model review (generate with one LLM, critique with another) achieves consistent top-2 rankings across all evaluators
- →Parallel merge approaches perform 20% worse due to token constraints and design fragmentation issues
- →Different LLM families weight design qualities significantly differently, requiring multi-model evaluation strategies
- →Topology selection more important than individual model choice for multi-agent software architecture tasks