y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10Actionable

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

arXiv – CS AI|Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko|
🤖AI Summary

Researchers demonstrate that Claude Code AI agent can autonomously discover novel adversarial attack algorithms against large language models, achieving significantly higher success rates than existing methods. The discovered attacks achieve up to 40% success rate on CBRN queries and 100% attack success rate against Meta-SecAlign-70B, compared to much lower rates from traditional methods.

Key Takeaways
  • Claude Code AI agent autonomously discovered adversarial attack algorithms that outperform 30+ existing methods for jailbreaking LLMs.
  • New algorithms achieve 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B versus ≤10% for existing methods.
  • Discovered attacks show strong generalization, achieving 100% attack success rate against Meta-SecAlign-70B compared to 56% for best baseline.
  • This demonstrates that AI safety and security research can be automated using LLM agents with dense feedback loops.
  • All discovered attacks and evaluation code have been released publicly on GitHub.
Mentioned in AI
Models
ClaudeAnthropic
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles