The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play
Researchers propose Anchored Bipolicy Self-Play, a new safety training method that addresses fundamental limitations in parameter-shared self-play red teaming by using distinct LoRA adapters for attacker and defender roles. The approach achieves 100x greater parameter efficiency and improved safety robustness across multiple language model scales without sacrificing reasoning ability.