AINeutralarXiv – CS AI · 6h ago6/10
🧠
SocraticPO: Policy Optimization via Interactive Guidance
SocraticPO is a new reinforcement learning framework that improves large language model training by combining natural-language teacher guidance with reward decay, rather than relying solely on scalar outcome rewards. The method shows improvements on scientific reasoning benchmarks while preventing models from exploiting teacher assistance as a shortcut to rewards.