#credit-assignment News & Analysis

51 articles tagged with #credit-assignment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

51 articles

AINeutralarXiv – CS AI · Jun 56/10

🧠

TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

Researchers propose TAPO (Tool-Aware Policy Optimization), a method that fixes credit misassignment problems in reinforcement learning for multimodal search agents. The technique improves training efficiency for AI systems that use tools, delivering consistent improvements across multiple benchmarks without requiring additional annotations or computational overhead.

AINeutralarXiv – CS AI · Jun 56/10

🧠

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

Researchers introduce RREDCoT, a novel method for improving reasoning language models by redistributing rewards at the segment level during reinforcement learning training. The approach addresses the high variance problem inherent in current Chain-of-Thought optimization methods by using the model itself to estimate which parts of reasoning traces deserve higher rewards, without requiring expensive additional computation.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

Researchers identify Trace-Mediated Peak Bias (TMPB), a systematic failure in deep reinforcement learning where agents irrationally prioritize high-magnitude reward spikes over trajectories with greater cumulative returns. This phenomenon mirrors the human Peak-End Rule cognitive bias and reveals how mathematical constraints in credit assignment systems naturally produce human-like value distortions, with adaptive optimizers offering a potential solution.

AINeutralarXiv – CS AI · Jun 26/10

🧠

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

Researchers propose ARCA, a new token-level credit assignment method for language model reinforcement learning that addresses degradation issues in parameter-efficient fine-tuning approaches like LoRA. By measuring where adapters actually modify hidden states rather than tracking output distribution shifts, ARCA provides non-degenerate credit signals competitive with existing baselines while requiring no additional learned components.

AINeutralarXiv – CS AI · Jun 26/10

🧠

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

Researchers introduce SPADER, a reinforcement learning framework that enables large language models to discover multiple valid answers to complex questions through tool-augmented search. The system combines step-wise credit assignment with diversity-aware rewards to improve recall and F1 scores across multiple QA benchmarks.

AIBullisharXiv – CS AI · Jun 16/10

🧠

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

Researchers introduce DecomposeR, a framework that trains language models to conduct deep research by explicitly representing plans as directed acyclic graphs rather than flat trajectories. The approach separates planning and execution into two distinct reinforcement learning stages, improving long-form answer generation by 5.1-8.0 points over comparable baselines on benchmark datasets.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Score Broadcast and Decorrelation: A General Framework for Broadcast-Based Credit Assignment

Researchers introduce Score Broadcast and Decorrelation (SBD), a theoretical framework that generalizes biologically plausible credit assignment mechanisms across diverse loss functions beyond MSE. The framework unifies error broadcast—an alternative to backpropagation that avoids weight transport—under a single orthogonality principle, with experimental validation showing improvements over existing broadcast approaches on image classification tasks.

AINeutralarXiv – CS AI · Jun 16/10

🧠

The Terminal Representation in Reinforcement Learning

Researchers introduce the Terminal Representation (TR), a novel approach to representation learning in reinforcement learning that encodes reward-weighted trajectories more efficiently than existing methods. The TR achieves comparable performance to established approaches like the Default Representation while reducing computational overhead and eliminating assumptions about symmetric transition dynamics.

AINeutralarXiv – CS AI · May 296/10

🧠

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

Researchers introduce Graph-Distance Contribution Reward (GDCR), a novel step-level credit assignment method for agentic search that evaluates individual agent actions by measuring progress toward answer nodes in knowledge graphs. Combined with Step Advantage Policy Optimization (SAPO), this approach improves upon trajectory-level reward systems that cannot assess the quality of intermediate steps, showing strong results across multiple benchmarks.

AINeutralarXiv – CS AI · May 296/10

🧠

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

Researchers propose a novel method for optimizing multi-agent LLM systems by decomposing credit assignment into temporal and structural components, enabling more efficient prompt optimization through targeted refinement rather than global updates. The approach uses state-space bottleneck analysis and role-based policy isolation to identify and fix weak components in collaborative AI systems, reducing computational queries while improving reasoning performance across benchmarks.

AIBullisharXiv – CS AI · May 296/10

🧠

Graph-Enhanced Policy Optimization in LLM Agent Training

Researchers present Graph-Enhanced Policy Optimization (GEPO), a new training framework for multi-step LLM agents that improves credit assignment by analyzing state-transition graphs and task relevance. The method achieves 1.1-3.8% performance gains across multiple benchmarks by differentiating the importance of individual steps and trajectories based on their structural and semantic roles.

AINeutralarXiv – CS AI · May 286/10

🧠

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Researchers introduce TRACER, a reinforcement learning framework that enables multiple large language models to collaborate effectively on reasoning tasks by learning when to speak and what to say through turn-level decision-making. The approach addresses key challenges in multi-agent AI systems including sparse rewards, computational inefficiency, and oscillating performance, demonstrating improvements across mathematical reasoning benchmarks.

AINeutralarXiv – CS AI · May 276/10

🧠

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

StepOPSD introduces a novel reinforcement learning framework that improves credit assignment in multi-turn agent tasks by treating individual steps rather than entire trajectories as the unit of learning. The method achieves state-of-the-art results on benchmark tasks like ALFWorld and Search-QA, demonstrating that step-level preference distillation is particularly effective when trajectory rewards poorly correlate with individual decision quality.

AIBullisharXiv – CS AI · May 126/10

🧠

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Researchers introduce MemQ, a novel framework that applies Q-learning eligibility traces to episodic memory in large language model agents, enabling credit assignment across memory dependencies recorded in provenance DAGs. The approach achieves superior performance across six diverse benchmarks, with gains up to 5.7 percentage points on multi-step tasks requiring deep memory chains.

AINeutralarXiv – CS AI · May 126/10

🧠

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

Researchers introduce PiCA (Pivot-Based Credit Assignment), a novel reinforcement learning mechanism that improves how LLM-based search agents learn from long sequences of actions. By identifying key pivot steps and anchoring rewards to final task outcomes, PiCA addresses critical challenges in credit assignment, delivering 15.2% performance gains on knowledge-intensive QA tasks.

AINeutralarXiv – CS AI · May 126/10

🧠

Verifiable Process Rewards for Agentic Reasoning

Researchers introduce Verifiable Process Rewards (VPR), a framework that enhances reinforcement learning for large language models by providing dense, intermediate-level feedback during reasoning tasks rather than relying solely on sparse outcome-level rewards. The approach leverages symbolic, algorithmic, and probabilistic verification methods to improve credit assignment in long-horizon agentic reasoning, with theoretical and empirical validation across multiple benchmarks.

AINeutralarXiv – CS AI · May 126/10

🧠

EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents

Researchers introduce EGL-SCA, a framework for graph reasoning agents that jointly optimizes both natural language instructions and computational tools through structural credit assignment. The system achieves 92.0% success rate on graph reasoning benchmarks by precisely routing failures to either prompt optimization or tool synthesis, outperforming isolated improvement approaches.

AINeutralarXiv – CS AI · May 116/10

🧠

Structured Role-Aware Policy Optimization for Multimodal Reasoning

Researchers introduce Structured Role-Aware Policy Optimization (SRPO), a reinforcement learning method that improves multimodal AI reasoning by assigning credit to different token types based on their functional roles. The approach enhances vision-language models' ability to ground answers in visual evidence without requiring external reward models, advancing more reliable multimodal reasoning systems.

AINeutralarXiv – CS AI · May 116/10

🧠

Learning CLI Agents with Structured Action Credit under Selective Observation

Researchers present a new approach to training CLI agents through reinforcement learning, introducing σ-Reveal for selective observation and A³ for credit assignment. The work addresses fundamental challenges in teaching AI systems to interact with command-line interfaces by leveraging structured action properties and proposing the ShellOps dataset for evaluation.

AINeutralarXiv – CS AI · May 116/10

🧠

In-Context Credit Assignment via the Core

Researchers propose a new mechanism for fairly distributing compensation among creators whose intellectual property appears in AI model context windows, using cooperative game theory's least core solution. The approach efficiently approximates fair value distribution while requiring significantly fewer computational resources than existing methods.

AINeutralarXiv – CS AI · May 116/10

🧠

Exact Is Easier: Credit Assignment for Cooperative LLM Agents

Researchers present C3, a novel credit assignment method for cooperative multi-agent LLM systems that achieves exact causal measurement without approximation by exploiting deterministic interaction histories. The method outperforms existing baselines across six benchmarks while reducing training costs, and introduces the first method-agnostic auditing tools for evaluating multi-agent credit assignment quality.

AINeutralarXiv – CS AI · May 96/10

🧠

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Skill1 presents a unified reinforcement learning framework that enables language model agents to co-evolve three coupled capabilities: skill selection, utilization, and distillation from a single task-outcome reward signal. Demonstrated improvements over existing baselines on complex tasks suggest advances in how AI agents can build and leverage persistent skill libraries across diverse problem domains.

AINeutralarXiv – CS AI · May 96/10

🧠

Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs

Researchers introduce Owen-Shapley Policy Optimization (OSPO), a reinforcement learning algorithm that improves how language models learn from feedback by attributing credit to individual tokens rather than treating entire sequences as atomic units. The method addresses a fundamental training gap in generative AI systems used for recommendation tasks, showing measurable improvements on real e-commerce datasets.

AINeutralarXiv – CS AI · May 46/10

🧠

PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

PORTool is a new policy-optimization algorithm that improves how AI agents learn to use external tools by solving the credit-assignment problem in multi-step reasoning tasks. The method uses a rewarded tree structure to assign rewards at individual steps rather than only at outcomes, enabling agents to achieve higher accuracy while reducing unnecessary tool calls.

AINeutralarXiv – CS AI · Apr 106/10

🧠

Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

Researchers propose T-STAR, a novel reinforcement learning framework that structures multi-step agent trajectories as trees rather than independent chains, enabling better credit assignment for LLM agents. The method uses tree-based reward propagation and surgical policy optimization to improve reasoning performance across embodied, interactive, and planning tasks.

← PrevPage 2 of 3Next →