When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?
Researchers introduce Prompted Policy Optimization (PromptPO), a method using large language models as black-box policy optimizers for reinforcement learning tasks. The approach demonstrates competitive or superior performance to traditional RL algorithms in exploration-heavy and robotics domains while requiring fewer environment interactions, though it underperforms in continuous control tasks like MuJoCo.
The research addresses a fundamental question in AI: whether LLMs can replace classical reinforcement learning algorithms for sequential decision-making tasks. PromptPO operates by iteratively prompting LLMs with Python descriptions of environment dynamics, then refining policies based on rollout feedback. This approach leverages LLMs' capacity to understand abstract problem specifications and generate executable code, positioning them as flexible policy optimizers without task-specific training.
The work builds on growing interest in using foundation models for control and planning. Traditional RL requires extensive environment interaction and domain-specific algorithm design, creating barriers for real-world deployment. LLM-based optimization potentially democratizes RL by reducing interaction costs and enabling transfer of knowledge across disparate domains through natural language specifications.
The results reveal clear boundaries for LLM utility. Strong performance on exploration-heavy environments and Meta-World robotics tasks suggests LLMs excel when prior knowledge about problem structure or planning strategies can be leveraged. However, performance degradation in MuJoCo continuous control domains highlights fundamental limitations—LLMs struggle with fine-grained numerical optimization over continuous action spaces. This reflects their training data biases toward discrete, symbolic reasoning over precise sensorimotor control.
For the AI community, these findings suggest a collaborative model where LLMs handle high-level strategy and discrete control while specialized algorithms manage continuous optimization. This hybrid approach could accelerate development of autonomous systems by reducing sample complexity and engineering overhead. The work also highlights that LLM effectiveness depends critically on problem structure, informing future architecture decisions for embodied AI systems.
- →LLMs can match or exceed standard RL baselines on exploration and robotics tasks while using substantially fewer environment interactions.
- →PromptPO autonomously discovers diverse policy representations from proportional controllers to planning algorithms without explicit guidance.
- →LLM-based policy optimization fails in fine-grained continuous control domains, revealing fundamental limitations in numerical optimization.
- →Performance depends on environment structure—LLMs leverage prior knowledge when available but cannot compensate for lack thereof.
- →Hybrid approaches combining LLM strategy with specialized continuous control algorithms may represent the optimal path forward.