#credit-assignment News & Analysis

51 articles tagged with #credit-assignment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

51 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

Researchers propose Group-Graph Policy Optimization (G2PO), a novel reinforcement learning algorithm that transforms linear interaction trajectories into state-transition graphs to improve credit assignment in long-horizon agentic tasks. The method demonstrates significant performance improvements on benchmark tasks like WebShop and ALFWorld, achieving up to 22.2% success rate gains over existing approaches.

AIBullisharXiv – CS AI · Jun 107/10

🧠

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

Researchers introduce 3SPO (State-Score-Supervised Policy Optimization), a reinforcement learning algorithm that optimizes LLM agent policies at each step rather than after complete episodes, addressing credit assignment challenges in sparse-reward environments. Experiments demonstrate 22.6% improvement over existing methods on ALFWorld benchmarks with 2.4x more state exploration and 1.8x faster convergence.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Reinforcement Learning from Rich Feedback with Distributional DAgger

Researchers introduce DistIL, a distributional variant of the DAgger imitation learning algorithm that leverages rich feedback signals beyond binary correctness labels to improve AI reasoning models. The approach uses forward cross-entropy objectives to enable better credit assignment and demonstrates monotonic policy improvement guarantees, outperforming standard reinforcement learning methods across scientific reasoning, coding, and mathematical problem-solving tasks.

AIBullisharXiv – CS AI · Jun 17/10

🧠

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

Researchers introduce HiPER, a hierarchical reinforcement learning framework that separates high-level planning from low-level execution for training LLM agents. The approach uses hierarchical advantage estimation to improve credit assignment in sparse-reward environments, achieving state-of-the-art results on interactive benchmarks with significant gains on long-horizon tasks.

AIBullisharXiv – CS AI · May 277/10

🧠

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

Researchers propose GraphGPO, a novel reinforcement learning method that improves credit assignment in agentic tasks by aggregating trajectories into a state-transition graph rather than relying on coarse-grained outcome-based attribution. This approach enables step-level credit recognition and achieves state-of-the-art performance on challenging benchmarks while significantly improving training efficiency.

AIBullisharXiv – CS AI · May 277/10

🧠

Credit Assignment with Resets in Language Model Reasoning

Researchers propose SRPO (Self-Reset Policy Optimization), a novel method that improves how language models learn from reasoning tasks by identifying and isolating problematic reasoning steps rather than treating entire solution trajectories uniformly. The technique uses the model itself to self-localize errors and reset to those points for resampling, outperforming standard approaches like GRPO without requiring external supervision.

AIBearisharXiv – CS AI · May 127/10

🧠

Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking

Researchers propose TRACE, a credit assignment framework that improves multi-turn jailbreak attacks on large language models by identifying which dialogue turns actually contribute to harmful outcomes. The method achieves 25% higher attack success rates than existing approaches and can be repurposed to strengthen AI safety defenses.

AIBullisharXiv – CS AI · May 97/10

🧠

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

Researchers propose a novel reinforcement learning framework that automatically generates process-level supervision from outcome-only feedback, eliminating the need for costly external process supervision. This approach enables fine-grained credit assignment in reasoning tasks by having models identify and learn from their own failed trajectories.

AIBullisharXiv – CS AI · May 97/10

🧠

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

Researchers propose Selective Eligibility Traces (S-trace), a new method for reinforcement learning that improves credit assignment in large language models by selectively identifying critical reasoning steps rather than uniformly crediting entire trajectories. The approach demonstrates performance gains of 0.49-3.16% across Qwen models while improving sample and token efficiency compared to existing critic-free algorithms.

AIBullisharXiv – CS AI · May 97/10

🧠

Milestone-Guided Policy Learning for Long-Horizon Language Agents

Researchers introduce BEACON, a milestone-guided policy learning framework that significantly improves training efficiency for long-horizon language agents by solving credit misattribution and sample inefficiency problems. The approach achieves 92.9% success rates on complex tasks—nearly double previous benchmarks—while improving sample utilization from 23.7% to 82.0%.

AIBullisharXiv – CS AI · May 47/10

🧠

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Researchers present AEM (Adaptive Entropy Modulation), a new credit assignment method for reinforcement learning that improves how language model agents learn from sparse rewards without requiring dense supervision. The technique adaptively modulates entropy during training to balance exploration and exploitation, achieving a 1.4% improvement on the challenging SWE-bench-Verified benchmark across models ranging from 1.5B to 32B parameters.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

Researchers propose Generative Actor-Critic (GenAC), a new approach to value modeling in large language model reinforcement learning that uses chain-of-thought reasoning instead of one-shot scalar predictions. The method addresses a longstanding challenge in credit assignment by improving value approximation and downstream RL performance compared to existing value-based and value-free baselines.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Researchers introduce Perception-Grounded Policy Optimization (PGPO), a novel fine-tuning framework that improves how large vision-language models learn from visual inputs by strategically allocating learning signals to vision-dependent tokens rather than treating all tokens equally. Testing on the Qwen2.5-VL series demonstrates an average 18.7% performance boost across multimodal reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Hindsight Credit Assignment for Long-Horizon LLM Agents

Researchers introduced HCAPO, a new framework that uses hindsight credit assignment to improve Large Language Model agents' performance in long-horizon tasks. The system leverages LLMs as post-hoc critics to refine decision-making, achieving 7.7% and 13.8% improvements over existing methods on WebShop and ALFWorld benchmarks respectively.

AIBullisharXiv – CS AI · Jun 256/10

🧠

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

Researchers introduce BiPACE, a novel advantage estimation method for training large language model agents that improves upon existing group-based reinforcement learning approaches. The method addresses fundamental credit assignment problems by using bisimulation-guided clustering and action-conditioned baselines, achieving significant performance improvements on benchmark tasks without requiring additional critics or rollouts.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Learning with a Single Rollout via Monte Carlo Pass@k Critic

Researchers propose SR-PPO, a reinforcement learning method that trains language models using single rollouts and Monte Carlo Pass@k critics for token-level credit assignment. The approach reduces computational costs while improving reasoning performance on mathematical benchmarks like HMMT26 and AIME24 by using reachability-based advantage estimation instead of repeated sampling.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Semantic Consistency Policy Optimization for Reinforcement Learning of LLM Agents

Researchers propose Semantic Consistency Policy Optimization (SCPO), a training method that improves how large language model agents learn from reinforcement learning by addressing a fundamental inconsistency: semantically similar intermediate steps receive contradictory credit signals based on whether their trajectory ultimately succeeds or fails. The approach recovers step-level credit from successful rollouts, achieving state-of-the-art performance on complex reasoning tasks like ALFWorld and WebShop.

AINeutralarXiv – CS AI · Jun 236/10

🧠

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

ARCO introduces an adaptive rubric framework that enables large language model agents to receive step-level interpretable rewards during multi-step reasoning tasks. By jointly evolving the reward rubric and policy through co-training, the method achieves stronger performance on question-answering benchmarks while providing explainable feedback that clarifies why each step in a trajectory succeeds or fails.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

A comprehensive survey maps reinforcement learning algorithm design decisions across three stages—MDP creation, exploration strategies, and learning approaches—revealing significant research gaps in LLM training where value-based methods and off-policy techniques remain underexplored despite proven effectiveness in classical RL.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Learning Process Rewards via Success Visitation Matching for Efficient RL

Researchers propose a novel reinforcement learning approach that converts sparse task rewards into dense process rewards by training a discriminator to identify successful episodes and incentivize policies to match their state-action visitations. The method demonstrates significantly faster training on robotic manipulation tasks without altering the optimal policy.

AINeutralarXiv – CS AI · Jun 116/10

🧠

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

Researchers introduce HERO, a self-distillation framework for reinforcement learning agents that uses environment observations as feedback to improve multi-turn decision-making. The method addresses credit assignment problems in sequential tasks by converting observations into actionable diagnoses, outperforming existing approaches on benchmark tasks with limited training data.

AINeutralarXiv – CS AI · Jun 116/10

🧠

APPO: Agentic Procedural Policy Optimization

Researchers propose Agentic Procedural Policy Optimization (APPO), a new reinforcement learning method that improves how AI agents learn to use tools by identifying fine-grained decision points rather than relying on coarse tool-call boundaries. The approach achieves ~4 point improvements across 13 benchmarks while maintaining efficiency and interpretability.

AINeutralarXiv – CS AI · Jun 106/10

🧠

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

Researchers propose SD-GRPO, a new machine learning technique that improves how multimodal AI systems generate long-form responses by analyzing outputs in semantic segments rather than as a single unit. The method addresses a fundamental limitation in existing GRPO frameworks when applied to vision-language tasks, showing consistent performance improvements across controlled and real-world benchmarks.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

Researchers propose DAC (Divide and Cooperate), a multi-agent training framework that separates evidence retrieval and answer generation into two specialized agents with cross-agent learning signals. This approach addresses credit assignment problems in language models performing multi-step reasoning and achieves competitive performance using parameter-efficient LoRA modules, outperforming full fine-tuning baselines on QA benchmarks.

AIBullisharXiv – CS AI · Jun 96/10

🧠

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

LEAF (Low-rank Exploration with Adaptive Forking) introduces a novel tree-based reinforcement learning method for training speech-aware large language models that improves credit assignment by identifying shared response prefixes and assigning rewards at the span level rather than uniformly across tokens. The approach achieves superior performance compared to existing GRPO-style methods without requiring additional computational overhead, enabling smaller models to match or exceed larger baselines.

Page 1 of 3Next →