Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO
Researchers introduce AdvGRPO, a co-training framework that enables stable joint optimization of AI attack and defense systems using reinforcement learning. The method produces transferable adversarial attacks while improving defender robustness on safety benchmarks, advancing the field of AI red teaming.
The research addresses a critical bottleneck in AI safety: the instability of GRPO (Group Relative Policy Optimization) when applied to adversarial co-training scenarios. Prior work demonstrated that PPO and DPO could effectively train attacker-defender pairs, but GRPO consistently underperformed despite theoretical advantages. AdvGRPO resolves this by implementing dense multi-channel rewards and decoupled advantage normalization, enabling simultaneous evolution of both attack and defense capabilities.
This work sits within the broader ecosystem of AI robustness research, where red teaming has emerged as a crucial methodology for identifying vulnerabilities before deployment. As language models become more widely integrated into critical applications, the ability to systematically discover novel attack vectors and train defenses in tandem becomes increasingly valuable. The curriculum-based approach—progressing from single-turn to multi-turn closed-loop attacks—mirrors human security testing methodologies and likely improves the realism and transferability of discovered vulnerabilities.
The practical implications extend to developers and organizations deploying large language models at scale. More effective red teaming frameworks directly translate to safer AI systems with reduced failure modes. The demonstrated transferability of attacks suggests the framework discovers fundamental vulnerabilities rather than artifacts of specific training conditions. This has implications for industry standards around AI safety certification and benchmarking.
Future research should examine whether AdvGRPO scales to larger model sizes and more complex attack spaces. The stability improvements may open pathways for continuous adversarial training during production deployment, creating adaptive defense mechanisms that evolve alongside emerging threats.
- →AdvGRPO stabilizes GRPO for joint attacker-defender training through dense multi-channel rewards and decoupled advantage normalization.
- →The curriculum-based approach progresses from single-turn to closed-loop multi-turn attacks, improving realism and attack transferability.
- →Co-trained defenders significantly outperform baseline models on established safety benchmarks.
- →The framework generates highly transferable attacks, suggesting discovery of fundamental vulnerabilities rather than training artifacts.
- →Results advance practical AI red teaming methodologies applicable to production language model safety validation.