#gradient-estimation News & Analysis

10 articles tagged with #gradient-estimation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

10 articles

AIBullisharXiv – CS AI · Jun 57/10

🧠

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

OrderGrad introduces a family of gradient estimators that optimize order-statistic objectives rather than expected returns, enabling policy-gradient methods to directly target risk-sensitive metrics like Value-at-Risk, Conditional Value-at-Risk, and best-of-K outcomes. The method works as a plug-and-play reward transformation compatible with standard reinforcement learning algorithms, with applications demonstrated in LLM post-training and other domains.

AIBullisharXiv – CS AI · Jun 27/10

🧠

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

Researchers introduce ProbMoE, a probabilistic routing framework that solves a fundamental challenge in training Mixture-of-Experts models by replacing discrete, non-differentiable top-k routing with a differentiable probabilistic approach. The method achieves comparable or improved performance while enabling dynamic expert allocation and better expert utilization across various benchmarks.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

Researchers introduce Dynamic Pruning Policy Optimization (DPPO), a new framework that accelerates AI language model training by 2.37x while maintaining accuracy. The method addresses computational bottlenecks in Group Relative Policy Optimization through unbiased gradient estimation and improved data efficiency.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning

Researchers propose a new decision-focused learning method using score function gradient estimation and stochastic smoothing to train machine learning models that directly optimize for task performance rather than prediction accuracy. The approach removes restrictive assumptions about problem structure, extending applicability to nonlinear objectives, constrained optimization, and two-stage stochastic problems.

AIBullisharXiv – CS AI · May 286/10

🧠

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Researchers introduce ProRL, a reinforcement learning framework designed to improve proactive recommender systems that guide users toward target items through sequential recommendations. The approach addresses fundamental gradient estimation problems in policy learning by implementing stepwise reward centering and position-specific advantage estimation, demonstrating superior performance on real-world datasets.

AINeutralarXiv – CS AI · May 276/10

🧠

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

Researchers introduce GAC, a noise-aware adaptive controller that optimizes the mixing of supervised fine-tuning and reinforcement learning during AI model post-training. By dynamically adjusting mixing weights based on gradient variance and signal disagreement, GAC outperforms fixed schedules across math, code, science, and logic tasks with minimal computational overhead.

AINeutralarXiv – CS AI · May 126/10

🧠

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

Researchers propose a marginalized reparameterization (MRP) estimator to enable practical use of mixture policies in reinforcement learning, addressing a long-standing gap between theoretical potential and practical implementation. By reducing variance compared to likelihood-ratio methods, MRP mixture policies achieve performance parity with standard Gaussian policies while offering greater flexibility in continuous action spaces.

🏢 Google

AIBullisharXiv – CS AI · May 116/10

🧠

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

Researchers propose CTPO (Cumulative Token Policy Optimization), a new approach to reinforcement learning for large language models that addresses the bias-variance tradeoff in importance sampling ratios. By using cumulative token-level ratios with position-adaptive clipping, CTPO achieves superior performance on mathematical reasoning benchmarks compared to existing methods like PPO and GRPO.

AIBullisharXiv – CS AI · Apr 136/10

🧠

On Divergence Measures for Training GFlowNets

Researchers propose improved divergence measures for training Generative Flow Networks (GFlowNets), comparing Renyi-α, Tsallis-α, and KL divergences to enhance statistical efficiency. The work introduces control variates that reduce gradient variance and achieve faster convergence than existing methods, bridging GFlowNets training with generalized variational inference frameworks.

AINeutralarXiv – CS AI · Mar 175/10

🧠

Align Forward, Adapt Backward: Closing the Discretization Gap in Logic Gate Networks

Researchers propose CAGE (Confidence-Adaptive Gradient Estimation) to solve the training-inference mismatch problem in neural networks that use soft mixtures during training but hard selection during inference. The method achieves over 98% accuracy on MNIST with zero selection gap, significantly outperforming existing approaches like Gumbel-ST which suffers accuracy collapse.