🧠 AI⚪ NeutralImportance 6/10

Who can we trust? LLM-as-a-jury for Comparative Assessment

arXiv – CS AI|Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill|May 29, 2026 at 04:00 AM

🤖AI Summary

Researchers propose BT-sigma, a novel method for aggregating Large Language Model judgments in comparative evaluations that accounts for varying judge reliability without requiring human supervision. The approach significantly improves ranking accuracy compared to traditional averaging methods by modeling each LLM's discriminative capability as an unsupervised calibration mechanism.

Analysis

The research addresses a fundamental challenge in AI evaluation: LLM judges exhibit inconsistent and biased behavior across different assessment tasks, yet most evaluation frameworks treat all judges as equally reliable. This inconsistency directly impacts the validity of NLG (natural language generation) assessments, which increasingly rely on LLM-based comparative judgments as cost-effective alternatives to human evaluation. The BT-sigma model extends the classical Bradley-Terry ranking framework by introducing judge-specific discriminator parameters, enabling simultaneous inference of item quality and judge reliability from pairwise comparisons alone.

The significance extends beyond academic evaluation. As LLM-based systems proliferate across production environments—from content moderation to autonomous decision-making—the reliability of their self-evaluation mechanisms becomes critical infrastructure. Current industry practice often aggregates multiple LLM judgments with equal weighting, potentially amplifying systematic biases rather than mitigating them. The research demonstrates that BT-sigma's learned discriminators correlate strongly with independent consistency measures, suggesting the model captures meaningful patterns in judge behavior.

For practitioners and researchers building evaluation pipelines, this work provides a practical unsupervised calibration approach that requires no labeled human judgments—a significant advantage in resource-constrained settings. The methodology applies directly to any scenario requiring pairwise comparative assessments aggregated across multiple LLM judges. The correlation between learned discriminators and cycle consistency metrics indicates the approach could serve as a diagnostic tool for identifying problematic judges in larger evaluation systems, enabling more robust AI assessment frameworks.

Key Takeaways

→BT-sigma outperforms equal-weighting aggregation by modeling individual LLM judge reliability without human supervision
→LLM judges exhibit substantial inconsistency in comparative judgments, limiting effectiveness of direct probability-based ranking methods
→The method enables simultaneous inference of item rankings and judge discriminative capability from pairwise comparisons alone
→Learned discriminator parameters strongly correlate with independent measures of LLM judgment cycle consistency
→Framework provides practical unsupervised calibration for production evaluation pipelines without requiring labeled training data