AINeutralarXiv – CS AI · 7h ago7/10
🧠
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
Researchers propose a compute-aware evaluation framework for assessing adversarial robustness in large language models, measuring attack effort in FLOPs rather than fixed query budgets. Testing across multiple models and attack strategies reveals that alignment training has non-monotonic effects on robustness, scaling reduces gradient-based attacks but not cheaper template-based ones, and safety measures leave certain harm categories disproportionately accessible.