Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
Researchers propose a compute-aware evaluation framework for assessing adversarial robustness in large language models, measuring attack effort in FLOPs rather than fixed query budgets. Testing across multiple models and attack strategies reveals that alignment training has non-monotonic effects on robustness, scaling reduces gradient-based attacks but not cheaper template-based ones, and safety measures leave certain harm categories disproportionately accessible.
Current adversarial robustness evaluations of language models mask critical information by reporting attack success rates under fixed computational budgets, obscuring the actual effort required to compromise security. This research addresses a fundamental gap in how the AI safety community quantifies LLM vulnerability, introducing computational pressure measured in FLOPs as a standardized metric that better reflects real-world attacker constraints and economic incentives.
The compute-aware framework reveals several counterintuitive findings about model security. Alignment training—the standard approach to improving safety—produces non-monotonic effects on robustness, suggesting that some safety techniques may create unexpected vulnerabilities. More concerning, while larger models show improved resistance to expensive gradient-based attacks, they remain vulnerable to cheaper template-based approaches, indicating that scaling alone provides incomplete protection. The discovery that gradient-based attacks transfer between surrogate and target models demonstrates that attackers can reduce costs through multi-model optimization.
For the AI safety and development community, these findings have immediate implications. The variability in compute costs across harm categories—ranging up to 5x within single models—suggests uneven protection where certain categories remain disproportionately accessible despite safety-aligned RL training. This heterogeneous vulnerability landscape requires targeted defenses rather than uniform safety approaches.
Moving forward, developers should adopt compute-aware evaluation as a standard practice alongside traditional metrics. The released framework enables more precise risk assessment and better resource allocation for defensive research. Understanding the true computational costs of attacks allows stakeholders to make informed decisions about which vulnerabilities warrant immediate mitigation and where current defenses sufficiently raise attacker costs.
- →Compute-aware evaluation using FLOPs provides a more accurate measure of adversarial effort than fixed query budgets
- →Alignment training has non-monotonic effects on robustness, sometimes creating unexpected vulnerabilities
- →Model scaling reduces gradient-based attack success but provides minimal protection against cheaper template-based attacks
- →Gradient-based attacks transfer between models, allowing attackers to reduce computational costs through surrogate optimization
- →Safety measures leave certain harm categories 5x more accessible than others, revealing uneven protection across risk types