AIBearisharXiv – CS AI · 9h ago7/10
🧠
Efficient Adversarial Attacks on High-dimensional Offline Bandits
Researchers demonstrate that offline bandit algorithms—used to evaluate machine learning models like image generators and LLMs—are vulnerable to adversarial attacks on their reward models. The study reveals that in high-dimensional settings, attackers can achieve near-perfect success rates with imperceptibly small perturbations to publicly available reward model weights, creating a critical security gap in AI evaluation systems.
🏢 Hugging Face