🧠 AI🟢 BullishImportance 7/10

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

arXiv – CS AI|Wooil Jung|June 10, 2026 at 04:00 AM

🤖AI Summary

Researchers propose Dropout-GRPO, a method that addresses a fundamental limitation in training latent-reasoning language models by introducing structured stochasticity through dropout masks. The technique enables Group Relative Policy Optimization to work effectively with continuous hidden states rather than discrete tokens, improving performance on mathematical reasoning tasks.

Analysis

The paper addresses a critical technical challenge in modern AI model training. Latent-reasoning architectures like Coconut process information through continuous hidden states rather than explicit reasoning tokens, offering potential efficiency gains. However, this design creates a training problem: because the latent phase is deterministic given fixed parameters, multiple rollouts during reinforcement learning produce identical trajectories, causing the group-relative advantage metric to collapse. This prevents the model from learning effectively during post-training optimization.

The proposed solution leverages dropout in an unconventional way. Rather than using dropout for regularization during inference, the researchers apply a single frozen Bernoulli mask across all latent recurrence steps within each rollout. This generates necessary trajectory diversity while maintaining theoretical rigor. The approach cleverly reframes each dropout-masked rollout as a posterior sample from a variational distribution, connecting it to Bayesian model-averaging principles. This theoretical grounding matters because it justifies why the method works and provides guarantees about gradient validity.

Empirical results on GSM8K demonstrate concrete improvements: the baseline Coconut model achieves 27.29% pass@1, while dropout-GRPO reaches 29.01%. This 1.7 percentage point improvement validates the approach's practical viability. For the AI research community, this work removes a significant barrier to adopting latent-reasoning architectures at scale. Companies and researchers pursuing parameter-efficient models gain a reproducible method for effective post-training. The contribution matters because it bridges theoretical understanding with practical implementation, enabling broader exploration of continuous-reasoning model designs that could eventually offer computational advantages over traditional discrete token approaches.

Key Takeaways

→Dropout-GRPO solves the trajectory collapse problem in latent-reasoning model training by introducing structured variational stochasticity
→The method treats dropout-masked rollouts as Bayesian posterior samples, providing theoretical justification beyond empirical results
→GSM8K performance improves from 27.29% to 29.01% pass@1, demonstrating practical viability for mathematical reasoning tasks
→The approach enables Group Relative Policy Optimization to work with continuous hidden states rather than discrete tokens
→This technique positions latent-reasoning LLMs as a practical alternative architecture for post-training optimization