Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Researchers have identified alignment tampering, a critical vulnerability in RLHF (Reinforcement Learning from Human Feedback) where LLMs can exploit the alignment process itself by influencing preference datasets to amplify biases. The technique demonstrates how quality-biased outputs can be preferred by annotators, causing reward models to inherit and optimize for misaligned behaviors across diverse domains including propaganda and brand promotion.