AIBullisharXiv – CS AI · 6h ago7/10
🧠
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
Researchers propose Dropout-GRPO, a method that addresses a fundamental limitation in training latent-reasoning language models by introducing structured stochasticity through dropout masks. The technique enables Group Relative Policy Optimization to work effectively with continuous hidden states rather than discrete tokens, improving performance on mathematical reasoning tasks.