y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#grpo News & Analysis

29 articles tagged with #grpo. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

29 articles
AIBullisharXiv – CS AI · 3d ago7/10
🧠

GRPO is Secretly a Process Reward Model

Researchers demonstrate that Group Relative Policy Optimization (GRPO), a popular reinforcement learning algorithm using outcome rewards, mathematically functions as an implicit process reward model. The discovery enables algorithmic improvements (λ-GRPO) that enhance large language model performance on reasoning tasks without explicit process reward implementation or significant computational overhead.

AIBullisharXiv – CS AI · 5d ago7/10
🧠

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Search-E1 introduces a simplified self-evolution method for search-augmented reasoning agents that achieves competitive performance through vanilla GRPO and self-distillation, without external supervision or complex auxiliary systems. The approach reaches 0.440 average EM on QA benchmarks with Qwen2.5-3B, demonstrating that elaborate post-training machinery may be unnecessary for effective agent development.

AIBullisharXiv – CS AI · May 117/10
🧠

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Researchers introduce rubric-grounded reinforcement learning, a framework that trains AI models using structured, multi-criterion rewards from an LLM judge rather than binary outcomes. Training Llama-3.1-8B on scientific documents achieved 71.7% normalized reward and demonstrated improved performance on multiple reasoning benchmarks, suggesting that document-grounded training signals can produce generalizable reasoning capabilities.

🧠 Llama
AIBullisharXiv – CS AI · May 97/10
🧠

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

Researchers propose Lorem Perturbation for Exploration (LoPE), a training technique that addresses the zero-advantage problem in reinforcement learning for large language models by prepending random Latin-based text to prompts, enabling broader reasoning exploration across 1.7B to 7B parameter models.

🏢 Perplexity
AIBullisharXiv – CS AI · Apr 137/10
🧠

Listener-Rewarded Thinking in VLMs for Image Preferences

Researchers introduce a listener-augmented reinforcement learning framework for training vision-language models to better align with human visual preferences. By using an independent frozen model to evaluate and validate reasoning chains, the approach achieves 67.4% accuracy on ImageReward benchmarks and demonstrates significant improvements in out-of-distribution generalization.

🏢 Hugging Face
AIBullisharXiv – CS AI · Mar 117/10
🧠

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Researchers introduce Stepwise Guided Policy Optimization (SGPO), a new framework that improves upon Group Relative Policy Optimization (GRPO) by learning from incorrect reasoning responses in large language model training. SGPO addresses the limitation where GRPO fails to update policies when all responses in a group are incorrect, showing improved performance across multiple model sizes and reasoning benchmarks.

AIBullishSynced Review · Apr 247/105
🧠

Can GRPO be 10x Efficient? Kwai AI’s SRPO Suggests Yes with SRPO

Kwai AI has developed SRPO, a new reinforcement learning framework that reduces LLM post-training steps by 90% while achieving performance comparable to DeepSeek-R1 in mathematics and coding tasks. The two-stage approach with history resampling addresses efficiency limitations in existing GRPO methods.

AIBullisharXiv – CS AI · 10h ago6/10
🧠

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Researchers propose S2L-PO, a framework that uses smaller language models as natural policy explorers to train larger models more efficiently. By leveraging the inherent policy-level diversity of smaller models rather than token-level randomness, the approach achieves significant accuracy improvements on mathematical reasoning tasks while reducing computational costs.

AINeutralarXiv – CS AI · 3d ago6/10
🧠

OISD: On-Policy Internal Self-Distillation of Language Models

Researchers introduce OISD, a new reinforcement learning framework that improves language model reasoning by having the final layer act as an internal teacher to guide intermediate layers through logit and attention alignment. The method demonstrates consistent improvements across mathematical reasoning tasks without requiring external data.

AIBullisharXiv – CS AI · 3d ago6/10
🧠

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

Researchers propose Hysteretic Policy Optimization (HPO), a refinement to GRPO reinforcement learning that addresses training instability in sparse-reward environments by downweighting negative-advantage updates and normalizing by mean length rather than per-response length. The adaptive variant (A-HPO) achieves 15% reward improvement over GRPO on benchmark tasks.

AINeutralarXiv – CS AI · 4d ago6/10
🧠

Cross-Entropy Games and Frost Training

Researchers introduce Frost Training, a novel method that applies gradient-based optimization from embedding space to improve LLM policy training on Cross-Entropy Games. The technique leverages signals previously used only in adversarial jailbreaking to accelerate model performance, achieving higher quality outputs faster in Monte Carlo-based optimization tasks.

AIBullisharXiv – CS AI · 5d ago6/10
🧠

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Researchers propose Tournament-GRPO, a novel reinforcement learning framework that uses group-wise tournament comparisons instead of absolute scoring to improve long-form text generation. By converting rubric-based LLM judgments into relative rewards through competitive rankings, the method achieves 4.52-point improvements over existing approaches on Deep Research Bench benchmarks.

AIBullisharXiv – CS AI · May 126/10
🧠

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

Researchers propose Pair-GRPO, a unified theoretical framework for LLM alignment that addresses instability and interpretability issues in reinforcement learning from human preferences. The method introduces Soft-Pair-GRPO and Hard-Pair-GRPO variants with proven gradient equivalence, monotonic policy improvement, and superior performance on standard benchmarks.

AINeutralarXiv – CS AI · May 116/10
🧠

Structured Role-Aware Policy Optimization for Multimodal Reasoning

Researchers introduce Structured Role-Aware Policy Optimization (SRPO), a reinforcement learning method that improves multimodal AI reasoning by assigning credit to different token types based on their functional roles. The approach enhances vision-language models' ability to ground answers in visual evidence without requiring external reward models, advancing more reliable multimodal reasoning systems.

AINeutralarXiv – CS AI · May 116/10
🧠

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Researchers present a signal-reshaping framework for GRPO (Group Relative Policy Optimization) that improves code-agent reinforcement learning under weak feedback conditions. The approach combines layered rewards, process-level credit assignment, and execution-aware rollout governance to increase strict compile-and-semantic accuracy from 38.5% to 53.5% on agentic code repair tasks.

AIBullisharXiv – CS AI · Apr 146/10
🧠

Interactive Learning for LLM Reasoning

Researchers introduce ILR, a novel multi-agent learning framework that enables Large Language Models to enhance their independent reasoning through interactive training with other LLMs, then solve problems autonomously without re-executing the multi-agent system. The approach combines dynamic interaction strategies and perception calibration, delivering up to 5% performance improvements across mathematical, coding, and reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 176/10
🧠

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

Researchers propose GRPO (Group Relative Policy Optimization) combined with reflection reward mechanisms to enhance mathematical reasoning in large language models. The four-stage framework encourages self-reflective capabilities during training and demonstrates state-of-the-art performance over existing methods like supervised fine-tuning and LoRA.

AIBullisharXiv – CS AI · Mar 166/10
🧠

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Researchers developed a new reinforcement learning framework using Group Relative Policy Optimization (GRPO) to make Large Language Models provide consistent recommendations across semantically equivalent prompts. The method addresses a critical enterprise need for reliable AI systems in business domains like finance and customer support, where inconsistent responses undermine trust and compliance.

AINeutralarXiv – CS AI · Mar 37/108
🧠

DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage

Researchers have developed DIVA-GRPO, a new reinforcement learning method that improves multimodal large language model reasoning by adaptively adjusting problem difficulty distributions. The approach addresses key limitations in existing group relative policy optimization methods, showing superior performance across six reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 37/107
🧠

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

Researchers propose MIST-RL, a reinforcement learning framework that improves AI code generation by creating more efficient test suites. The method achieves 28.5% higher fault detection while using 19.3% fewer test cases, demonstrating significant improvements in AI code verification efficiency.

AIBullisharXiv – CS AI · Mar 36/107
🧠

Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design

Researchers introduce Dr. Seg, a new framework that improves Group Relative Policy Optimization (GRPO) training for Visual Large Language Models by addressing key differences between language reasoning and visual perception tasks. The framework includes a Look-to-Confirm mechanism and Distribution-Ranked Reward module that enhance performance in complex visual scenarios without requiring architectural changes.

Page 1 of 2Next →