The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge
Researchers analyze multi-agent debate systems in AI by examining whether internal confidence signals (log-probabilities) correlate with external reasoning quality assessments and task accuracy. The study reveals significant role asymmetry between debating agents, with confidence metrics predicting reasoning quality twice as strongly for constructive agents compared to auditing agents, suggesting debate systems may have inherent structural biases.
This research addresses a critical gap in evaluating large language model debate systems, which typically focus only on final answer correctness while ignoring the quality of intermediate reasoning—the purported value proposition of multi-agent debate. By correlating three distinct signals—token-level log-probabilities, LLM-as-judge rubric scores, and task accuracy—the authors create a framework for diagnosing when and why debate systems succeed or fail at producing reliable reasoning.
The findings reveal a fundamental asymmetry in how different debate roles function. The Constructor, responsible for generating initial reasoning, demonstrates confidence signals that align closely with judged reasoning quality (roughly 2:1 ratio versus the Auditor), and excels at identifying critical reasoning failures (AUROC 0.804). The Auditor role, designed to scrutinize and improve proposals, shows weaker confidence-to-quality correlation and poor failure detection (AUROC 0.634). This suggests that current multi-agent architectures may not effectively leverage the auditing function as intended.
These insights carry implications for deploying debate systems in high-stakes domains. If one agent's confidence signals prove unreliable for predicting reasoning quality, practitioners cannot safely use internal model confidence as a proxy for output reliability. The consistent four-phase confidence trajectory across rubric-scoring tasks suggests debate systems may exhibit predictable failure modes that could be proactively monitored. However, the paper's findings are currently limited to initial domain testing, requiring validation across mathematical reasoning and factual QA to establish generalizability and determine whether these asymmetries persist across different problem types.
- →Multi-agent debate systems show role-dependent confidence-to-quality alignment, with Constructor agents' internal signals predicting reasoning quality twice as reliably as Auditor agents
- →Token-level log-probabilities alone cannot serve as reliable indicators of reasoning quality, particularly for auditing roles
- →Critical reasoning failure detection varies dramatically by role (AUROC 0.804 for Constructor vs 0.634 for Auditor), indicating structural vulnerabilities in current debate architectures
- →Four-phase confidence trajectories observed in debate suggest systematic patterns that could enable proactive error detection if validated across domains
- →Intermediate reasoning quality assessment requires multi-signal evaluation combining confidence metrics, LLM-as-judge rubrics, and task outcomes for robust system diagnosis