AIBearisharXiv – CS AI · 3h ago7/10
🧠
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
Researchers demonstrate that single-axis bias mitigations in AI reward models often redirect optimization pressure to correlated biases rather than eliminating it—a failure mode called reward bias substitution. The study proves that successful mitigation, bias substitution, and overcorrection produce identical observable results under standard audit metrics, meaning current evaluation methods cannot distinguish between genuine fixes and problematic redirections.