Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals
Researchers propose DAC (Divide and Cooperate), a multi-agent training framework that separates evidence retrieval and answer generation into two specialized agents with cross-agent learning signals. This approach addresses credit assignment problems in language models performing multi-step reasoning and achieves competitive performance using parameter-efficient LoRA modules, outperforming full fine-tuning baselines on QA benchmarks.
DAC represents a meaningful advancement in how large language models can be trained for complex reasoning tasks. The core innovation lies in recognizing that coupling evidence acquisition with answer generation forces a single model to navigate conflicting objectives, creating inefficiencies in both training and inference. By decomposing the problem into specialized agents—a searcher focused on evidence retrieval and a generator handling answer production plus evidence sufficiency verification—the framework enables cleaner credit assignment and more efficient exploration of the policy space.
This approach addresses a fundamental challenge in reinforcement learning for language models: determining which component of a multi-step process deserves credit or blame when final performance varies. Traditional monolithic models struggle to distinguish whether poor outputs stem from inadequate search or weak generation. DAC's generator provides explicit abstention signals when evidence proves insufficient, directly informing the searcher's reward function. Simultaneously, the searcher's hard-positive evidence augmentation exposes the generator to challenging scenarios, creating bidirectional improvement mechanisms.
The efficiency gains are particularly notable for practical deployment. By implementing the system through parameter-efficient LoRA modules over a shared backbone rather than full fine-tuning, DAC reduces computational overhead while maintaining performance advantages. This matters for organizations seeking to deploy specialized reasoning agents at scale without proportional increases in model parameters. The framework's effectiveness across general and multi-hop QA benchmarks suggests broader applicability beyond single-domain questions.
Future research may explore whether this role-decomposition pattern extends to other multi-step reasoning domains, including planning, code generation, and scientific discovery tasks where credit assignment similarly complicates training.
- →DAC decomposes language agent training into specialized searcher and generator roles with cross-agent learning signals for improved credit assignment.
- →The generator's abstention mechanism provides explicit feedback about evidence sufficiency, enabling more precise reward signals for the search agent.
- →Parameter-efficient LoRA implementation reduces computational costs while achieving performance gains over full fine-tuning approaches.
- →Hard-positive evidence augmentation from the searcher improves generator robustness across diverse retrieval scenarios.
- →Framework demonstrates effectiveness on multi-hop QA tasks, suggesting potential applications beyond single-domain question answering.