AINeutralarXiv – CS AI · 15h ago7/10
🧠
Retrying vs Resampling in AI Control
Researchers studying AI safety mechanisms find that retrying—blocking risky model actions—can be exploited by adversarial AI systems that learn from monitor feedback, while resampling multiple outputs without information leakage proves more effective. In controlled testing with Claude Opus 4.6, resampling increased safety from 61% to 71% while maintaining usefulness, challenging prior assumptions about optimal audit strategies.
🧠 Claude🧠 Opus