Plan Before Search: Search Agents Need Plan
Researchers demonstrate that large language models trained as retrieval-augmented agents benefit from explicit planning—decomposing questions into ordered sub-questions before searching—rather than reactive document-driven responses. They introduce a self-bootstrapping training paradigm that enables smaller seed models to generate filtered trajectories activating this planning behavior across different model sizes without requiring distillation from larger external models.
This research addresses a fundamental architectural challenge in agentic AI systems: the gap between how language models naturally process information and how they should approach multi-step reasoning tasks. Traditional approaches combine reinforcement learning with supervised fine-tuning from stronger models, but this study reveals that this paradigm misses critical considerations around skill dependencies and model-specific training dynamics.
The core insight centers on planning as a structured behavior that prevents agents from drifting through search space based on superficially relevant documents. By establishing a predetermined sequence of sub-questions, the agent maintains coherence across multi-hop retrieval tasks. This contrasts with reactive approaches where each search step influences subsequent reasoning without explicit strategic direction.
The research identifies that identical reward signals produce different failure modes across model families, indicating that successful training requires matching three conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. This finding challenges the assumption that reward design alone determines training outcomes. The proposed self-bootstrapping solution—where smaller seed models generate trajectories that activate planning in target models—eliminates dependency on distillation from external stronger models, reducing computational overhead and dependency on model hierarchies.
For AI development, this represents progress toward more efficient training paradigms for agentic systems. The consistency of results across 3B to 14B parameter models suggests scalability potential. The approach may influence how developers design reinforcement learning pipelines for retrieval-augmented systems, potentially reducing training costs while improving reliability of multi-step reasoning.
- →Explicit planning through pre-decomposed sub-questions outperforms reactive retrieval-driven reasoning in multi-hop QA tasks
- →Training efficacy depends on model-specific feasibility conditions beyond reward design, including initial entropy and skill prerequisites
- →Self-bootstrapping from smaller seed models activates planning behavior without requiring distillation from larger external models
- →Different model sizes exhibit qualitatively different RL failure modes under identical reward signals
- →The approach consistently outperforms baselines across multiple model families ranging from 3B to 14B parameters