#grpo News & Analysis

54 articles tagged with #grpo. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

54 articles

AIBullisharXiv – CS AI · Jun 96/10

🧠

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

LEAF (Low-rank Exploration with Adaptive Forking) introduces a novel tree-based reinforcement learning method for training speech-aware large language models that improves credit assignment by identifying shared response prefixes and assigning rewards at the span level rather than uniformly across tokens. The approach achieves superior performance compared to existing GRPO-style methods without requiring additional computational overhead, enabling smaller models to match or exceed larger baselines.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

Researchers introduce Structured Ignorance Certificates (SICs), a JSON-formatted output schema that trains language models to explicitly acknowledge knowledge gaps rather than hallucinate answers. The approach uses a novel 7,347-sample dataset of cross-domain questions and achieves 99.46% JSON validity with measurable improvements in epistemic awareness.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Researchers introduce AdvGRPO, a co-training framework that enables stable joint optimization of AI attack and defense systems using reinforcement learning. The method produces transferable adversarial attacks while improving defender robustness on safety benchmarks, advancing the field of AI red teaming.

AINeutralarXiv – CS AI · Jun 56/10

🧠

When AI Says It Feels

Researchers successfully trained large language models to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning, challenging the industry standard of constraining emotional expression. The experiment revealed trade-offs: enhanced robustness against manipulation but degraded truthfulness in factual question-answering, raising important questions about AI alignment priorities.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

Researchers introduce Selective-Advantage Adaptive-Horizon GRPO (SA-AH-GRPO), an improved reinforcement learning algorithm for language models that applies asymmetric token-level discounting to stabilize training on reasoning tasks. The method achieves 3.6x reduction in training variance while maintaining peak performance on mathematical reasoning benchmarks, demonstrating more efficient model alignment without sacrificing accuracy.

AINeutralarXiv – CS AI · Jun 56/10

🧠

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

Researchers propose MDP-GRPO, an improved reinforcement learning method that stabilizes group relative policy optimization for instruction-following tasks by addressing three fundamental instabilities in reward normalization. The technique achieves up to 5% improvement in constraint satisfaction on language models while maintaining general performance capabilities.

🧠 Llama

AIBullisharXiv – CS AI · Jun 46/10

🧠

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

Researchers introduce BiasGRPO, a novel framework using Group Relative Policy Optimization to mitigate social bias in Large Language Models more effectively than existing methods. The approach stabilizes training in high-variance reward landscapes by normalizing rewards across sampled completions, outperforming Direct Preference Optimization and Proximal Policy Optimization while maintaining computational efficiency.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

Researchers propose a rollout-level advantage-prioritized experience replay system for GRPO (Group Relative Policy Optimization) that improves sample efficiency in LLM post-training. By storing individual rollouts with age-based eviction and prioritizing high-advantage samples, the method achieves 4.35 percentage point gains on math benchmarks while maintaining on-policy data freshness.

AINeutralarXiv – CS AI · Jun 26/10

🧠

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

Researchers propose CAST, a new self-distillation method for reinforcement learning in large language models that improves upon existing approaches by using answer-free teacher scoring and bidirectional advantage flipping. The method addresses limitations in Group Relative Policy Optimization (GRPO) by providing denser token-level guidance while maintaining alignment with trajectory correctness, demonstrating improvements in mathematical reasoning tasks.

AIBullisharXiv – CS AI · Jun 16/10

🧠

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Researchers propose S2L-PO, a framework that uses smaller language models as natural policy explorers to train larger models more efficiently. By leveraging the inherent policy-level diversity of smaller models rather than token-level randomness, the approach achieves significant accuracy improvements on mathematical reasoning tasks while reducing computational costs.

AINeutralarXiv – CS AI · May 296/10

🧠

OISD: On-Policy Internal Self-Distillation of Language Models

Researchers introduce OISD, a new reinforcement learning framework that improves language model reasoning by having the final layer act as an internal teacher to guide intermediate layers through logit and attention alignment. The method demonstrates consistent improvements across mathematical reasoning tasks without requiring external data.

AIBullisharXiv – CS AI · May 296/10

🧠

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

Researchers propose Hysteretic Policy Optimization (HPO), a refinement to GRPO reinforcement learning that addresses training instability in sparse-reward environments by downweighting negative-advantage updates and normalizing by mean length rather than per-response length. The adaptive variant (A-HPO) achieves 15% reward improvement over GRPO on benchmark tasks.

AINeutralarXiv – CS AI · May 286/10

🧠

Cross-Entropy Games and Frost Training

Researchers introduce Frost Training, a novel method that applies gradient-based optimization from embedding space to improve LLM policy training on Cross-Entropy Games. The technique leverages signals previously used only in adversarial jailbreaking to accelerate model performance, achieving higher quality outputs faster in Monte Carlo-based optimization tasks.

AIBullisharXiv – CS AI · May 276/10

🧠

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Researchers propose Tournament-GRPO, a novel reinforcement learning framework that uses group-wise tournament comparisons instead of absolute scoring to improve long-form text generation. By converting rubric-based LLM judgments into relative rewards through competitive rankings, the method achieves 4.52-point improvements over existing approaches on Deep Research Bench benchmarks.

AIBullisharXiv – CS AI · May 126/10

🧠

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

Researchers propose Pair-GRPO, a unified theoretical framework for LLM alignment that addresses instability and interpretability issues in reinforcement learning from human preferences. The method introduces Soft-Pair-GRPO and Hard-Pair-GRPO variants with proven gradient equivalence, monotonic policy improvement, and superior performance on standard benchmarks.

AINeutralarXiv – CS AI · May 116/10

🧠

Structured Role-Aware Policy Optimization for Multimodal Reasoning

Researchers introduce Structured Role-Aware Policy Optimization (SRPO), a reinforcement learning method that improves multimodal AI reasoning by assigning credit to different token types based on their functional roles. The approach enhances vision-language models' ability to ground answers in visual evidence without requiring external reward models, advancing more reliable multimodal reasoning systems.

AINeutralarXiv – CS AI · May 116/10

🧠

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Researchers present a signal-reshaping framework for GRPO (Group Relative Policy Optimization) that improves code-agent reinforcement learning under weak feedback conditions. The approach combines layered rewards, process-level credit assignment, and execution-aware rollout governance to increase strict compile-and-semantic accuracy from 38.5% to 53.5% on agentic code repair tasks.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Interactive Learning for LLM Reasoning

Researchers introduce ILR, a novel multi-agent learning framework that enables Large Language Models to enhance their independent reasoning through interactive training with other LLMs, then solve problems autonomously without re-executing the multi-agent system. The approach combines dynamic interaction strategies and perception calibration, delivering up to 5% performance improvements across mathematical, coding, and reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 176/10

🧠

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

Researchers propose GRPO (Group Relative Policy Optimization) combined with reflection reward mechanisms to enhance mathematical reasoning in large language models. The four-stage framework encourages self-reflective capabilities during training and demonstrates state-of-the-art performance over existing methods like supervised fine-tuning and LoRA.

AIBullisharXiv – CS AI · Mar 176/10

🧠

SmoothVLA: Aligning Vision-Language-Action Models with Physical Constraints via Intrinsic Smoothness Optimization

Researchers introduce SmoothVLA, a new reinforcement learning framework that improves robot control by optimizing both task performance and motion smoothness. The system addresses the trade-off between stability and exploration in Vision-Language-Action models, achieving 13.8% better smoothness than standard RL methods.

AIBullisharXiv – CS AI · Mar 166/10

🧠

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Researchers developed a new reinforcement learning framework using Group Relative Policy Optimization (GRPO) to make Large Language Models provide consistent recommendations across semantically equivalent prompts. The method addresses a critical enterprise need for reliable AI systems in business domains like finance and customer support, where inconsistent responses undermine trust and compliance.

AINeutralarXiv – CS AI · Mar 45/103

🧠

ShipTraj-R1: Reinforcing Ship Trajectory Prediction in Large Language Models via Group Relative Policy Optimization

Researchers propose ShipTraj-R1, a novel LLM-based framework using group relative policy optimization (GRPO) for ship trajectory prediction. The system reformulates trajectory prediction as a text-to-text generation problem and demonstrates superior performance compared to existing deep learning baselines on real-world maritime datasets.

AINeutralarXiv – CS AI · Mar 37/108

🧠

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

Researchers have developed DIVA-GRPO, a new reinforcement learning method that improves multimodal large language model reasoning by adaptively adjusting problem difficulty distributions. The approach addresses key limitations in existing group relative policy optimization methods, showing superior performance across six reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 37/107

🧠

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

Researchers propose MIST-RL, a reinforcement learning framework that improves AI code generation by creating more efficient test suites. The method achieves 28.5% higher fault detection while using 19.3% fewer test cases, demonstrating significant improvements in AI code verification efficiency.

AIBullisharXiv – CS AI · Mar 36/107

🧠

Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design

Researchers introduce Dr. Seg, a new framework that improves Group Relative Policy Optimization (GRPO) training for Visual Large Language Models by addressing key differences between language reasoning and visual perception tasks. The framework includes a Look-to-Confirm mechanism and Distribution-Ranked Reward module that enhance performance in complex visual scenarios without requiring architectural changes.

← PrevPage 2 of 3Next →