Efficient Adversarial Attacks on High-dimensional Offline Bandits
Researchers demonstrate that offline bandit algorithms—used to evaluate machine learning models like image generators and LLMs—are vulnerable to adversarial attacks on their reward models. The study reveals that in high-dimensional settings, attackers can achieve near-perfect success rates with imperceptibly small perturbations to publicly available reward model weights, creating a critical security gap in AI evaluation systems.
This research exposes a fundamental vulnerability in how machine learning systems are currently evaluated at scale. Bandit algorithms have become the de facto standard for efficiently ranking candidates without exhaustive testing, particularly for expensive-to-evaluate generative models. The shift toward offline evaluation using logged data and public reward models—distributed freely on platforms like Hugging Face—has accelerated adoption but introduced an overlooked attack surface: the reward model itself can be weaponized before the bandit ever runs.
The theoretical contribution is striking: the paper proves a counterintuitive high-dimensional phenomenon where attack difficulty decreases as input dimensionality increases. This directly threatens modern image and language model evaluation, where inputs routinely occupy millions of dimensions. An attacker with access to publicly available reward model weights can craft minimal perturbations that completely hijack which candidates the bandit selects as optimal.
For the AI development community, this has immediate practical implications. Companies relying on bandit-based evaluation for model selection face supply-chain risks from compromised reward models. The vulnerability extends to any system where evaluation feedback determines production decisions—from image generation quality assessments to language model safety scoring. Unlike adversarial attacks on training data, poisoning the reward model leaves no obvious traces in logs or data artifacts.
Defensive strategies require either cryptographic verification of reward model integrity, diversification across multiple independent evaluators, or architectural changes to bandit algorithms themselves. The research demonstrates that naive defenses fail; organizations must implement targeted countermeasures. As AI evaluation infrastructure becomes increasingly centralized and automated, this vulnerability could affect deployment decisions across the entire ecosystem.
- →Adversarial attacks on public reward models can manipulate offline bandit evaluation with imperceptible perturbations, achieving near-perfect attack success rates.
- →High-dimensional inputs create a counterintuitive vulnerability where attack difficulty decreases as dimensionality increases, making image and text evaluation especially susceptible.
- →Reward models distributed publicly on platforms like Hugging Face represent an exploitable attack surface that existing security practices don't address.
- →Compromised reward models can silently redirect AI system selection decisions without detectable traces in training data or evaluation logs.
- →Organizations using bandit algorithms for model evaluation need cryptographic verification or multi-evaluator redundancy to defend against this threat class.