AIBullisharXiv – CS AI · 18h ago7/10
🧠
Reward Shaping for (Inference-Time) Alignment: A Stackelberg Game Perspective
Researchers propose a Stackelberg game framework for optimizing reward models in large language model alignment, addressing the suboptimality of standard KL-regularized reward optimization. A simple reward shaping scheme improves inference-time alignment by reducing base policy bias while mitigating reward hacking risks, demonstrating 66%+ win rates against baselines.