AINeutralarXiv – CS AI · 7h ago6/10
🧠
Safe Equilibrium Policy Optimization for Strategic Agent Policies
Researchers propose Safe Equilibrium Policy Optimization (SEPO), a training method that prevents language model agents from exploiting weaker opponents, colluding on harmful outcomes, or externalizing costs during multi-agent interactions. The technique augments standard reward optimization with penalties for exploitability and collusion risk, demonstrated across strategic domains including Prisoner's Dilemma, auctions, and poker.