Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward
Researchers introduce DecomposeR, a framework that trains language models to conduct deep research by explicitly representing plans as directed acyclic graphs rather than flat trajectories. The approach separates planning and execution into two distinct reinforcement learning stages, improving long-form answer generation by 5.1-8.0 points over comparable baselines on benchmark datasets.
DecomposeR addresses a fundamental challenge in using large language models for complex research tasks: the difficulty of training models to plan effectively when planning and execution are entangled in monolithic trajectories. Traditional approaches either oversimplify tasks into short-form QA pairs or optimize entire research sequences as single units, making it hard to isolate what the model learns about planning versus execution. By structuring research plans as typed DAGs, the framework creates explicit, inspectable representations that can be independently optimized.
This approach reflects broader progress in AI systems that decompose complex problems into interpretable stages. Recent work in chain-of-thought reasoning, tool use, and modular architectures demonstrates that explicit structure improves both performance and debuggability. DecomposeR extends this principle by making planning tokens directly rewardable, enabling finer-grained credit assignment during reinforcement learning. The two-stage training process—first optimizing graph structure and query decomposition, then branch-level execution—mirrors how human researchers actually approach deep investigations.
The results on long-form benchmarks suggest meaningful gains in research quality, particularly relevant for applications requiring synthesis across multiple information sources. For developers building AI research assistants, retrieval systems, or knowledge synthesis tools, this work provides a trainable blueprint for structured reasoning. The 5-8 point improvements indicate the approach captures meaningful planning improvements beyond incremental gains. As models scale to larger sizes, structured planning mechanisms may become increasingly important for maintaining reasoning quality and interpretability across longer, more complex tasks.
- →DecomposeR separates planning from execution using directed acyclic graphs, enabling better credit assignment in reinforcement learning for research tasks.
- →Two-stage training optimizes graph structure first, then branch-level execution, improving long-form answer quality by 5.1-8.0 points on benchmarks.
- →Structured planning representations make model reasoning more interpretable and debuggable compared to flat trajectory optimization.
- →The framework applies reinforcement learning rewards to explicit planner tokens rather than entire trajectories, improving training signal quality.
- →Results demonstrate that decomposing complex research into explicit planning stages outperforms end-to-end training on deep reasoning tasks.