AIBullisharXiv – CS AI · 7h ago7/10
🧠
Safety Alignment of LMs via Non-cooperative Games
Researchers introduce AdvGame, a new safety alignment method that frames language model defense as a non-zero-sum game between Attacker and Defender LMs trained jointly through reinforcement learning. The approach improves both safety and utility simultaneously by enabling continuous adversarial adaptation, with the resulting Attacker LM serving as a deployable red-teaming tool.