Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning
A comprehensive survey maps reinforcement learning algorithm design decisions across three stages—MDP creation, exploration strategies, and learning approaches—revealing significant research gaps in LLM training where value-based methods and off-policy techniques remain underexplored despite proven effectiveness in classical RL.
This arXiv survey provides a systematic framework for understanding reinforcement learning in large language models by decomposing RL algorithm design into modular components. The research identifies a critical asymmetry in the field: current LLM post-training relies heavily on policy gradient methods (PPO, GRPO) while ignoring entire categories of established RL techniques that have demonstrated value in other domains.
The taxonomy spans MDP formulation choices including reward functions, state/action spaces, and discount factors that directly impact training efficiency. Exploration strategies range from simple temperature sampling to sophisticated tree search and curriculum learning approaches. The learning dimension analysis contrasts model-free versus model-based methods, value versus policy orientations, and on-policy versus off-policy paradigms alongside credit assignment mechanisms.
These gaps represent concrete optimization opportunities. Value-based methods and bootstrapping approaches have produced state-of-the-art results in other reinforcement learning domains yet remain largely absent from LLM training literature. Off-policy actor-critic techniques could potentially improve sample efficiency, a critical concern given the computational costs of LLM training at scale.
For AI researchers and practitioners, this framework enables identifying unexplored combinations that might yield performance improvements or training efficiency gains. The survey creates a shared vocabulary between classical RL and LLM communities, facilitating knowledge transfer. Developers working on LLM optimization pipelines gain a structured decision-making tool for selecting RL components based on their specific training objectives rather than following established defaults.
- →Current LLM training concentrates on policy gradient methods while overlooking value-based and off-policy approaches proven effective in classical RL.
- →MDP design choices including reward functions and state/action spaces significantly impact LLM training but lack systematic exploration in literature.
- →Bootstrapping-based credit assignment and off-policy actor-critic training represent high-potential unexplored areas for LLM optimization.
- →The survey provides a modular taxonomy enabling researchers to systematically identify and test missing RL algorithm combinations.
- →Transfer of established RL techniques to LLM training could improve sample efficiency and training performance.