Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Researchers propose a causally motivated method to reduce biases in reward models used for LLM alignment by identifying and suppressing neurons correlated with spurious features like response length. The technique achieves comparable performance to much larger models while editing less than 2% of neurons, suggesting biases are concentrated in early network layers.