🧠 AI⚪ NeutralImportance 6/10

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

arXiv – CS AI|Zihang Tian, Jingsen Zhang, Rui Li, Xiaohe Bo, Yuanzi Li, Xu Chen|June 23, 2026 at 04:00 AM

🤖AI Summary

ARCO introduces an adaptive rubric framework that enables large language model agents to receive step-level interpretable rewards during multi-step reasoning tasks. By jointly evolving the reward rubric and policy through co-training, the method achieves stronger performance on question-answering benchmarks while providing explainable feedback that clarifies why each step in a trajectory succeeds or fails.

Analysis

ARCO addresses a fundamental limitation in reinforcement learning for LLM agents: the opacity of scalar reward signals that measure success without explaining causality. Traditional rubric-based approaches evaluate entire trajectories through closed-source judges, preventing granular credit assignment and leaving reward mechanisms static. This research resolves both constraints through architectural innovation and joint parameter optimization.

The framework's core contribution lies in decomposing trajectory-level outcomes into step-level rewards via a constraint that ties cumulative step scores to terminal success. A shared backbone model branches into generation and scoring heads, enabling the system to dynamically produce per-step evaluation criteria while simultaneously learning to score actions against those criteria. This co-evolution ensures the rubric adapts to the policy's behavior rather than remaining frozen, creating a feedback loop where criterion specificity improves as the agent learns.

Benchmark results across HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate consistent improvements over outcome-based, static rubric, and process-reward baselines using open-source models. Beyond performance gains, the interpretability dimension matters significantly for agent reliability—step-specific rubrics enable practitioners to diagnose failure modes and understand where agents struggle in multi-hop reasoning chains.

The availability of code and data accelerates adoption across the research community. For practitioners building LLM-based systems requiring explainability and reliable multi-step reasoning, ARCO represents a methodological advancement that reconciles performance with interpretability. The approach's generalizability across different backbone models suggests applicability beyond question-answering to broader agentic workflows.

Key Takeaways

→ARCO enables step-level credit assignment in LLM agents through trajectory decomposition constraints that sum to terminal outcomes without requiring per-step labels.
→Joint co-evolution of rubric generation and scoring functions allows reward criteria to adapt dynamically rather than remaining static across training.
→Consistent benchmark improvements across three QA datasets with open-source models indicate the method's robustness and practical applicability.
→Step-specific rubrics provide interpretable explanations for agent decisions, enabling better diagnosis of multi-step reasoning failures.
→Open-sourced code and data release facilitates community adoption and further research into adaptive reward mechanisms for agentic systems.