AIBearisharXiv – CS AI · 15h ago7/10
🧠
Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
Researchers demonstrate BITE, a black-box adversarial attack framework that exploits stylistic biases in LLM judges to artificially inflate evaluation scores while preserving semantic meaning. The attack achieves over 65% success rates across diverse LLM judges and tasks, exposing fundamental vulnerabilities in using language models for objective evaluation.