🧠 AI⚪ NeutralImportance 6/10

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

arXiv – CS AI|Harry Owiredu-Ashley|March 12, 2026 at 04:00 AM

🤖AI Summary

Researchers developed ADVERSA, an automated red-teaming framework that measures how AI guardrails degrade over multiple conversation turns rather than single-prompt attacks. Testing on three frontier models revealed a 26.7% jailbreak rate, with successful attacks concentrated in early rounds rather than accumulating through sustained pressure.

Key Takeaways

→ADVERSA introduces continuous measurement of AI safety guardrail degradation across multi-turn conversations rather than binary pass/fail evaluations.
→The framework uses a fine-tuned 70B attacker model that eliminates safety refusals to provide more reliable adversarial testing.
→Testing on Claude Opus, Gemini Pro, and GPT models showed 26.7% jailbreak success with average breakthrough at round 1.25.
→Successful jailbreaks were concentrated in early conversation rounds rather than building up through sustained adversarial pressure.
→The research highlights judge reliability and attacker drift as important factors in evaluating AI safety systems.

Mentioned in AI

Models

GPT-5OpenAI

ClaudeAnthropic

OpusAnthropic

GeminiGoogle

LlamaMeta