Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
Researchers propose Generative Actor-Critic (GenAC), a new approach to value modeling in large language model reinforcement learning that uses chain-of-thought reasoning instead of one-shot scalar predictions. The method addresses a longstanding challenge in credit assignment by improving value approximation and downstream RL performance compared to existing value-based and value-free baselines.
The research tackles a fundamental problem in applying reinforcement learning to large language models: how to accurately estimate value functions for credit assignment. Traditional actor-critic methods rely on learned value models to guide policy optimization, but this approach has fallen out of favor in modern LLM RL because discriminative critics prove unreliable and fail to scale predictably. The paper's core insight centers on representation complexity theory, which suggests that one-shot scalar value prediction may be inherently limited in expressiveness, explaining why conventional critics plateau rather than improve with scale.
The proposed Generative Actor-Critic addresses this limitation by replacing scalar value prediction with a generative model capable of chain-of-thought reasoning before producing estimates. This approach mirrors recent successes in using reasoning-based methods for complex LLM tasks. The introduction of In-Context Conditioning ensures the critic remains calibrated to the actor's evolving policy during training, addressing the stability issues that plague discriminative approaches.
The work carries significant implications for LLM RL development. Stronger value modeling directly improves credit assignment, a bottleneck that constrains policy learning efficiency and sample efficiency. Better credit assignment translates to faster convergence, more reliable learning, and stronger final policies. For practitioners developing LLM-based agents, this suggests that revisiting classical RL foundations with modern generative modeling techniques can unlock performance gains. The research also validates a broader principle: architectural choices matter as much as scale in achieving reliable neural function approximation.
- →Generative critics using chain-of-thought reasoning outperform traditional one-shot scalar value prediction in LLM RL
- →Conventional discriminative value models fail to scale reliably due to representation complexity limitations
- →In-Context Conditioning keeps critics calibrated to evolving actor policies, improving training stability
- →Stronger value modeling improves credit assignment and downstream RL performance against value-based and value-free baselines
- →Classical RL foundations combined with modern generative techniques yield practical improvements in LLM agent development