Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
Researchers propose a critique-and-routing controller for multi-agent LLM systems that iteratively refines outputs through sequential decision-making rather than one-shot routing. The method uses reinforcement learning with agent-utilization constraints to achieve performance approaching the strongest agent while reducing computational calls by over 75%, advancing coordination efficiency in heterogeneous AI systems.
This research addresses a fundamental limitation in current multi-agent LLM architectures: static routing decisions that prevent iterative improvement. Traditional controllers select a single model and return its output immediately, missing opportunities for refinement through critique and rerouting. The proposed critique-and-routing controller reformulates coordination as a sequential decision problem, allowing the system to evaluate intermediate drafts, determine whether to continue refining, and select the optimal next agent for further work.
The technical approach formulates this challenge as a finite-horizon Markov Decision Process with explicit constraints on agent utilization—a practical consideration that prevents overuse of expensive or specialized models. By optimizing controller decisions via policy gradients under a Lagrangian-relaxed objective, the authors enable adaptive routing that balances quality with computational efficiency. This builds on recent trends in agentic AI systems that move beyond simple routing toward dynamic orchestration.
The empirical results demonstrate substantial practical value. Across seven reasoning benchmarks and multiple heterogeneous model configurations, the method consistently outperforms existing baselines while using any single agent for fewer than 25% of total inference calls. This efficiency gain matters significantly for production deployments where inference costs directly impact operational expenses and latency constraints.
The work reflects growing recognition that LLM system design requires intelligent coordination mechanisms beyond prompt engineering. As organizations deploy increasingly complex multi-model architectures—combining specialized models for different reasoning types, modalities, or domains—controllers that enable iterative refinement become critical infrastructure. Future research likely extends these techniques to handle dynamic agent pools, heterogeneous cost structures, and real-time quality metrics.
- →Critique-and-routing controllers enable iterative refinement in multi-agent LLM systems instead of static one-shot routing decisions.
- →Formulating agent coordination as an MDP with utilization constraints optimizes both output quality and computational efficiency.
- →The method achieves near-strongest-agent performance while reducing per-agent call frequency to under 25% of total inference.
- →Policy gradient optimization under Lagrangian relaxation provides a scalable framework for balancing multiple competing objectives.
- →Results across seven benchmarks demonstrate consistent improvements over existing multi-agent coordination baselines.