AINeutralarXiv – CS AI · 8h ago6/10
🧠
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
Researchers introduce POISE, a reinforcement learning method that uses a language model's internal hidden states to estimate baseline values for policy optimization, eliminating the computational overhead of separate critic models. The approach demonstrates comparable performance to existing methods while requiring significantly less compute, enabling more efficient training of large reasoning models.