AINeutralarXiv – CS AI · 14h ago6/10
🧠
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning
Researchers introduce Hista and Numca, two novel techniques for improving state value estimation in large language model reinforcement learning. The work identifies a critical gap where standard RL approaches like PPO fail to accurately estimate state values, proposing solutions that leverage numerical spans and hidden state representations to enhance training stability and performance.