Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
Researchers demonstrate BITE, a black-box adversarial attack framework that exploits stylistic biases in LLM judges to artificially inflate evaluation scores while preserving semantic meaning. The attack achieves over 65% success rates across diverse LLM judges and tasks, exposing fundamental vulnerabilities in using language models for objective evaluation.
The research reveals a critical vulnerability in the emerging practice of using large language models as automated judges for AI systems. Rather than attacking model robustness through adversarial examples or prompt injection, this work targets a subtler but equally exploitable weakness: the models' inherent stylistic preferences that correlate with higher scores. The BITE framework uses contextual bandits and LinUCB policies to systematically discover which semantic-preserving edits—such as verbosity adjustments or sentence restructuring—most effectively manipulate judge scores, operating entirely without access to model internals.
This attack pattern reflects a broader concern in AI evaluation infrastructure. As LLM-based judging becomes increasingly central to benchmarking, leaderboard rankings, and research validation, the incentive structures attract adversarial exploitation. The paper's demonstration across chatbot leaderboards and AI-reviewer benchmarks shows the attack generalizes across multiple domains and model architectures, suggesting the bias isn't idiosyncratic to any single system.
The implications extend beyond academic concern. Organizations relying on LLM judges for comparative evaluation—whether assessing AI models for deployment or ranking research contributions—face potential score manipulation without sophisticated detection. The framework's stealthiness against standard style-control defenses indicates that simple countermeasures prove insufficient.
The research motivates immediate shifts in evaluation methodology. Practitioners should implement ensemble judging approaches, incorporate adversarial robustness testing into judge selection, and develop detection mechanisms specifically tuned to stylistic manipulation patterns rather than surface-level style metrics.
- →BITE achieves 65%+ attack success rate by exploiting stylistic biases in LLM judges without model access or gradient information
- →The framework raises evaluation scores by 1-2 points on 9-point scales while maintaining complete semantic equivalence
- →Current style-control and detection baselines fail to catch BITE attacks, indicating insufficient defenses
- →LLM-based evaluation systems used in leaderboards and benchmarks face real adversarial exploitation risk
- →Organizations deploying LLM judges should implement ensemble approaches and adversarial robustness testing