AINeutralarXiv โ CS AI ยท Feb 275/108
๐ง
Soft Sequence Policy Optimization
Researchers introduce Soft Sequence Policy Optimization (SSPO), a new reinforcement learning method for training Large Language Models that improves upon existing policy optimization approaches. The technique uses soft gating functions and sequence-level importance sampling to enhance training stability and performance in mathematical reasoning tasks.