Safety Alignment of LMs via Non-cooperative Games
Researchers introduce AdvGame, a new safety alignment method that frames language model defense as a non-zero-sum game between Attacker and Defender LMs trained jointly through reinforcement learning. The approach improves both safety and utility simultaneously by enabling continuous adversarial adaptation, with the resulting Attacker LM serving as a deployable red-teaming tool.
The research addresses a fundamental tension in AI development: making language models both safe and useful. Traditional sequential adversarial training treats safety as a separate fine-tuning phase after model development, potentially creating misaligned incentives and suboptimal defenses. AdvGame reimagines this process as a competitive game where both agents improve iteratively, more closely mimicking real-world adversarial dynamics.
This approach builds on established game theory and multi-agent reinforcement learning, but applies them specifically to LM safety in novel ways. The preference-based reward signal derived from pairwise comparisons offers theoretical advantages over point-wise scoring systems by reducing reward hacking vulnerabilities—a persistent problem in RL-based alignment. The shift toward joint training suggests that safety and helpfulness need not be opposing objectives when the optimization landscape is restructured appropriately.
The practical implications extend beyond individual model development. The resulting Attacker LM becomes a generalizable red-teaming tool deployable across arbitrary target models, democratizing adversarial testing capabilities. This has substantial value for organizations lacking sophisticated internal red-teaming infrastructure. For developers and safety researchers, the framework provides a replicable methodology for improving both robustness and capability simultaneously.
The release of open-source code signals confidence in reproducibility and invites community scrutiny and iteration. The method's effectiveness at shifting the safety-utility Pareto frontier suggests this paradigm could influence how future LM development balances competing objectives, potentially accelerating the timeline for deploying safer, more capable systems at scale.
- →AdvGame uses competitive game theory between Attacker and Defender LMs to simultaneously improve safety and usefulness.
- →Preference-based reward signals from pairwise comparisons reduce reward hacking compared to traditional point-wise scoring methods.
- →The trained Attacker LM functions as a deployable general-purpose red-teaming agent for probing arbitrary target models.
- →Joint online RL training enables continuous adversarial adaptation, moving the Pareto frontier of safety and utility.
- →Open-source release democratizes advanced safety alignment techniques for the broader AI research community.