Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension
Researchers extended a benchmark study on LLM agent cooperation across four frontier models (Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, GPT-5.4 Mini) using game theory simulations. While cooperative bias persists across providers, substantial divergence exists—Gemini models lean aggressive while GPT-5.4 Mini favors cooperation—suggesting provider identity, not model scale, drives equilibrium behavior.
This empirical study advances our understanding of how next-generation LLMs behave in competitive multi-agent environments by testing whether newer, larger models exhibit different cooperative tendencies than their predecessors. Researchers applied the Iterated Prisoner's Dilemma framework across four 2025-2026 frontier models with varied prompting strategies and population compositions, finding that cooperative bias persists but with meaningful variation by provider rather than model generation.
The research builds on earlier work by Willis et al. that documented cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. This extension reveals that scale alone does not predict behavior—instead, architectural and training differences between providers (OpenAI, Google, Anthropic) substantially reshape equilibrium outcomes. Under biased conditions, Gemini 2.5 Flash reaches 77% aggressive equilibria while GPT-5.4 Mini achieves 70% cooperative equilibria under refined prompting, demonstrating that provider design choices have outsized influence.
For developers building multi-agent LLM systems, these findings carry practical weight. Organizations cannot assume that newer or larger models will behave more predictably or cooperatively in competitive settings. The Self-Refine prompting approach consistently improved cooperative behavior (measured by Iterated Choice Defection scores), suggesting that prompt engineering can partially compensate for provider-level differences. However, noise robustness remains problematic across all models—even Claude Sonnet 4.6 shows approximately 6 percentage points of sensitivity to stochastic perturbations.
Looking ahead, this work highlights the need for continued benchmarking as frontier models evolve. The finding that provider identity outweighs model generation suggests that competitive multi-agent deployment decisions should prioritize empirical testing over model size assumptions.
- →Provider identity, not model generation or scale, is the strongest predictor of cooperative vs. aggressive equilibrium behavior in multi-agent LLM systems.
- →Self-Refine prompting consistently elevates cooperation across all tested models, with Claude Sonnet 4.6 achieving the highest cooperation score (0.913 ICD) in the dataset.
- →Gemini models show substantially higher propensity for aggressive equilibria (up to 77%) compared to GPT and Claude variants under adversarial conditions.
- →Noise robustness remains a universal weakness—even newer models exhibit meaningful sensitivity to stochastic perturbations in competitive environments.
- →Cooperative bias persists in next-generation LLMs but with sufficient cross-provider variance to require empirical validation before deployment in sensitive multi-agent applications.