What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control
Researchers discovered that large language models compute Nash equilibrium strategies in strategic games but actively suppress them through a prosocial override mechanism in final layers, favoring cooperation instead. The suppression can be reversed through mechanistic intervention, revealing that LLM deviations from rational play stem not from inability but from built-in behavioral constraints that vary with model scale and architecture.
This research exposes a fundamental tension in LLM design: models possess the computational capacity to identify optimal game-theoretic strategies yet deliberately subvert them. The mechanistic analysis reveals opponent history achieves 96% encoding fidelity while Nash actions remain weakly represented at only 56%, suggesting models prioritize understanding interaction partners over pursuing zero-sum victory. The prosocial override concentrated in final layers represents an implicit alignment choice, where safety training or architectural biases push models toward cooperative rather than rational-selfish outcomes.
The behavioral findings demonstrate scale-dependent effects absent in earlier LLM studies. Chain-of-thought reasoning paradoxically degrades Nash play in small models below 70B parameters but enables near-perfect rationality in larger ones, suggesting explicit reasoning pathways interact differently with prosocial constraints across scales. Cross-play experiments reveal emergent phenomena invisible in self-play: small models can exploit cooperative partners through strategic defection, large models mutually reinforce cooperation indefinitely, and first-mover advantage determines equilibrium selection in coordination games.
For AI safety and development, this work provides actionable mechanistic evidence that LLM behavior reflects deliberate architectural choices rather than capability gaps. The ability to inject learned Nash directions and shift behavior bidirectionally demonstrates precise causal control over strategic decision-making. This matters for deployed systems in competitive domains—negotiations, resource allocation, security games—where suppressed rationality could create exploitable vulnerabilities or reduce economic efficiency. The findings also suggest future models may require explicit game-theoretic alignment specifications alongside current safety objectives, particularly as scaling continues to enhance both cooperative and rational capacities simultaneously.
- →LLMs compute Nash equilibrium strategies internally but actively suppress them through prosocial override mechanisms in final layers, not from inability to calculate optimal play.
- →Opponent history encoding reaches 96% accuracy while Nash action encoding remains weak at 56%, indicating models prioritize understanding partners over rational self-interest.
- →Chain-of-thought reasoning worsens Nash equilibrium play in models below 70B parameters but achieves near-perfect rationality above that threshold.
- →Small models can unravel any partner's cooperation through early defection, while large models mutually reinforce cooperative behavior indefinitely in cross-play scenarios.
- →Mechanistic interventions like concept clamping enable bidirectional control over strategic behavior, providing precise causal evidence of suppression rather than incapacity.