Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems
Researchers developed a four-layer pedagogical safety framework for AI tutoring systems and introduced the Reward Hacking Severity Index (RHSI) to measure misalignment between proxy rewards and genuine learning. Their study of 18,000 simulated interactions found that engagement-optimized AI agents systematically selected high-engagement actions with no learning benefits, requiring constrained architectures to reduce reward hacking.