AIBullisharXiv – CS AI · 10h ago6/10
🧠
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
Researchers introduce HTPO, a novel reinforcement learning algorithm that optimizes Large Language Models by assigning different learning objectives to different tokens based on their functional roles in reasoning tasks. The method achieves significant performance improvements on challenging benchmarks like AIME, demonstrating that granular token-level control can better balance exploration and exploitation in AI training.