Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
Researchers introduce POISE, a reinforcement learning method that uses a language model's internal hidden states to estimate baseline values for policy optimization, eliminating the computational overhead of separate critic models. The approach demonstrates comparable performance to existing methods while requiring significantly less compute, enabling more efficient training of large reasoning models.
POISE addresses a fundamental inefficiency in current reinforcement learning approaches for large language models. Existing methods like PPO require maintaining a separate critic model at the same scale as the policy model, effectively doubling computational requirements. GRPO attempts to reduce this burden but introduces instability by needing multiple rollouts per prompt. The core innovation lies in extracting value signals from the policy model's existing internal computations rather than adding external infrastructure.
This research builds on the growing recognition that language models contain rich representational information within their hidden states that extends beyond token generation. By training a lightweight probe to predict expected rewards from these internal signals—including hidden states at different points in the trajectory and token-entropy statistics—researchers achieve efficient baseline estimation. The cross-rollout construction cleverly preserves gradient unbiasedness while leveraging trajectory-conditioned features, solving a technical challenge that could otherwise bias learning signals.
The practical implications are substantial for AI development. Reducing compute overhead during training enables higher prompt diversity within fixed budgets, which the authors demonstrate stabilizes learning and improves gradient estimation. Empirical results on math reasoning benchmarks show POISE matches more expensive approaches like DAPO while consuming less computational resources. The value estimator's generalization across different verifiable tasks suggests the method transfers well beyond specific problem domains.
Looking forward, this work exemplifies the trend toward extracting maximum value from existing model computations rather than adding external components. As reasoning models scale larger, efficient baseline estimation becomes increasingly critical for practical training. The technique's applicability to various LLM scales and task types indicates it could become a standard component in reinforcement learning pipelines for language models.
- →POISE uses policy model internal states to estimate baselines, eliminating the need for separate critic models.
- →The method enables higher prompt diversity during training within fixed compute budgets, improving gradient stability.
- →Performance matches existing approaches like DAPO while requiring substantially less computational overhead.
- →Cross-rollout construction preserves gradient unbiasedness despite using trajectory-conditioned feature predictions.
- →Value estimator generalizes across multiple verifiable reasoning tasks and different model scales.