AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers introduce the Terminal Representation (TR), a novel approach to representation learning in reinforcement learning that encodes reward-weighted trajectories more efficiently than existing methods. The TR achieves comparable performance to established approaches like the Default Representation while reducing computational overhead and eliminating assumptions about symmetric transition dynamics.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce Graph-Distance Contribution Reward (GDCR), a novel step-level credit assignment method for agentic search that evaluates individual agent actions by measuring progress toward answer nodes in knowledge graphs. Combined with Step Advantage Policy Optimization (SAPO), this approach improves upon trajectory-level reward systems that cannot assess the quality of intermediate steps, showing strong results across multiple benchmarks.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers propose a novel method for optimizing multi-agent LLM systems by decomposing credit assignment into temporal and structural components, enabling more efficient prompt optimization through targeted refinement rather than global updates. The approach uses state-space bottleneck analysis and role-based policy isolation to identify and fix weak components in collaborative AI systems, reducing computational queries while improving reasoning performance across benchmarks.
AIBullisharXiv – CS AI · May 296/10
🧠Researchers present Graph-Enhanced Policy Optimization (GEPO), a new training framework for multi-step LLM agents that improves credit assignment by analyzing state-transition graphs and task relevance. The method achieves 1.1-3.8% performance gains across multiple benchmarks by differentiating the importance of individual steps and trajectories based on their structural and semantic roles.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers introduce TRACER, a reinforcement learning framework that enables multiple large language models to collaborate effectively on reasoning tasks by learning when to speak and what to say through turn-level decision-making. The approach addresses key challenges in multi-agent AI systems including sparse rewards, computational inefficiency, and oscillating performance, demonstrating improvements across mathematical reasoning benchmarks.
AINeutralarXiv – CS AI · May 276/10
🧠StepOPSD introduces a novel reinforcement learning framework that improves credit assignment in multi-turn agent tasks by treating individual steps rather than entire trajectories as the unit of learning. The method achieves state-of-the-art results on benchmark tasks like ALFWorld and Search-QA, demonstrating that step-level preference distillation is particularly effective when trajectory rewards poorly correlate with individual decision quality.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce MemQ, a novel framework that applies Q-learning eligibility traces to episodic memory in large language model agents, enabling credit assignment across memory dependencies recorded in provenance DAGs. The approach achieves superior performance across six diverse benchmarks, with gains up to 5.7 percentage points on multi-step tasks requiring deep memory chains.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce PiCA (Pivot-Based Credit Assignment), a novel reinforcement learning mechanism that improves how LLM-based search agents learn from long sequences of actions. By identifying key pivot steps and anchoring rewards to final task outcomes, PiCA addresses critical challenges in credit assignment, delivering 15.2% performance gains on knowledge-intensive QA tasks.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce Verifiable Process Rewards (VPR), a framework that enhances reinforcement learning for large language models by providing dense, intermediate-level feedback during reasoning tasks rather than relying solely on sparse outcome-level rewards. The approach leverages symbolic, algorithmic, and probabilistic verification methods to improve credit assignment in long-horizon agentic reasoning, with theoretical and empirical validation across multiple benchmarks.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce EGL-SCA, a framework for graph reasoning agents that jointly optimizes both natural language instructions and computational tools through structural credit assignment. The system achieves 92.0% success rate on graph reasoning benchmarks by precisely routing failures to either prompt optimization or tool synthesis, outperforming isolated improvement approaches.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce Structured Role-Aware Policy Optimization (SRPO), a reinforcement learning method that improves multimodal AI reasoning by assigning credit to different token types based on their functional roles. The approach enhances vision-language models' ability to ground answers in visual evidence without requiring external reward models, advancing more reliable multimodal reasoning systems.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers present a new approach to training CLI agents through reinforcement learning, introducing σ-Reveal for selective observation and A³ for credit assignment. The work addresses fundamental challenges in teaching AI systems to interact with command-line interfaces by leveraging structured action properties and proposing the ShellOps dataset for evaluation.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose a new mechanism for fairly distributing compensation among creators whose intellectual property appears in AI model context windows, using cooperative game theory's least core solution. The approach efficiently approximates fair value distribution while requiring significantly fewer computational resources than existing methods.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers present C3, a novel credit assignment method for cooperative multi-agent LLM systems that achieves exact causal measurement without approximation by exploiting deterministic interaction histories. The method outperforms existing baselines across six benchmarks while reducing training costs, and introduces the first method-agnostic auditing tools for evaluating multi-agent credit assignment quality.
AINeutralarXiv – CS AI · May 96/10
🧠Skill1 presents a unified reinforcement learning framework that enables language model agents to co-evolve three coupled capabilities: skill selection, utilization, and distillation from a single task-outcome reward signal. Demonstrated improvements over existing baselines on complex tasks suggest advances in how AI agents can build and leverage persistent skill libraries across diverse problem domains.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers introduce Owen-Shapley Policy Optimization (OSPO), a reinforcement learning algorithm that improves how language models learn from feedback by attributing credit to individual tokens rather than treating entire sequences as atomic units. The method addresses a fundamental training gap in generative AI systems used for recommendation tasks, showing measurable improvements on real e-commerce datasets.
AINeutralarXiv – CS AI · May 46/10
🧠PORTool is a new policy-optimization algorithm that improves how AI agents learn to use external tools by solving the credit-assignment problem in multi-step reasoning tasks. The method uses a rewarded tree structure to assign rewards at individual steps rather than only at outcomes, enabling agents to achieve higher accuracy while reducing unnecessary tool calls.
AINeutralarXiv – CS AI · Apr 106/10
🧠Researchers propose T-STAR, a novel reinforcement learning framework that structures multi-step agent trajectories as trees rather than independent chains, enabling better credit assignment for LLM agents. The method uses tree-based reward propagation and surgical policy optimization to improve reasoning performance across embodied, interactive, and planning tasks.
AINeutralarXiv – CS AI · Mar 114/10
🧠Researchers propose CORA, a new cooperative game-theoretic method for credit assignment in multi-agent reinforcement learning that uses coalition-wise advantage allocation. The approach addresses policy optimization challenges by evaluating marginal contributions of different agent coalitions and demonstrates superior performance across various benchmarks.