POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
Researchers introduce POETS, a novel framework that optimizes large language models through compute-efficient policy ensembles while quantifying uncertainty. By leveraging KL-regularized Thompson sampling and shared backbone architectures with independent LoRA branches, POETS achieves superior sample efficiency in scientific discovery tasks while reducing computational overhead compared to traditional ensemble methods.
POETS represents a meaningful advancement in the intersection of reinforcement learning and large language model optimization, addressing a fundamental challenge in sequential decision-making: balancing exploration with exploitation under computational constraints. The framework elegantly sidesteps the computationally expensive nested process of training separate uncertainty-aware reward models and policies by directly training policy ensembles to encode epistemic uncertainty through KL-regularized objectives. This architectural innovation holds practical significance because ensemble methods, while theoretically powerful, typically demand prohibitive computational resources when applied to LLMs.
The technical contribution stems from a key insight: policies trained with KL regularization implicitly encode underlying reward functions. By matching these implicitly encoded functions against bootstrapped online data, POETS achieves theoretical guarantees—specifically cumulative regret bounds of O(√Tγ_T)—without sacrificing computational efficiency. The use of shared pre-trained backbones with independent Low-Rank Adaptation branches enables meaningful ensemble diversity while maintaining memory efficiency.
Empirical validation across protein search and quantum circuit design demonstrates state-of-the-art sample efficiency, suggesting practical applications in domains where data acquisition proves expensive. The framework's particular strength in off-policy and small dataset regimes indicates utility for real-world scenarios where collecting large training corpora remains infeasible. For AI researchers and practitioners building optimization systems, POETS provides a computationally tractable path toward uncertainty-aware decision-making at scale. The work bridges academic theory and practical implementation, offering both theoretical rigor and empirical validation that positions it as a methodologically sound contribution to the field.
- →POETS achieves uncertainty quantification in LLM optimization without expensive nested training loops through implicit reward function encoding.
- →The framework uses shared backbone architectures with independent LoRA branches to enable efficient ensemble methods on large language models.
- →Theoretical analysis proves POETS conducts KL-regularized Thompson sampling with strong cumulative regret bounds of O(√Tγ_T).
- →Empirical results demonstrate state-of-the-art sample efficiency in scientific discovery domains including protein search and quantum circuit design.
- →The approach shows particular robustness in off-policy settings and small dataset regimes, relevant to real-world deployment constraints.