When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis
Researchers propose an Interpretive Audit Pipeline that uses multi-model disagreement to improve how federal agencies evaluate LLM categorization of public comments. Analysis of 1,260 USDA comments across four LLMs reveals significant interpretive divergence between models, suggesting that standard accuracy metrics alone miss critical differences in how AI systems organize policy input.
This research addresses a fundamental gap in how government agencies deploy language models for public policy analysis. When federal agencies use LLMs to categorize thousands of public comments, the model's organizational choices directly influence which arguments policymakers encounter and prioritize. Standard evaluation methods focus on accuracy against validated test sets, but this approach obscures a critical problem: different models can produce materially different categorizations of identical inputs, each potentially valid yet reaching different conclusions.
The study examines this phenomenon by analyzing 1,260 USDA docket comments across four separate LLMs. The findings reveal that disagreement between different models exceeds variation caused by different prompts to the same model—a striking result suggesting the model architecture itself drives interpretive choices. Crucially, when human experts applied standardized rubrics to resolve disagreement, they suppressed rather than resolved underlying ambiguity. The two-stage labeling experiment confirmed this dynamic: human annotators frequently introduced framings absent from the full AI ensemble, indicating that human judgment adds qualitatively different interpretive dimensions.
This research carries significant implications for AI governance and policy implementation. As federal agencies increasingly rely on LLMs for administrative tasks affecting millions of citizens, the ability to detect and audit interpretive disagreement becomes essential infrastructure for legitimate policymaking. The paper's recommendation to treat disagreement as diagnostic information rather than failure creates new evaluation methodologies that could improve transparency and accountability. Future government AI deployments should incorporate disagreement-based audits alongside accuracy metrics to ensure that algorithmic categorization genuinely serves democratic policy processes.
- →Multi-model disagreement on public comment categorization exceeds within-model variation, indicating fundamental interpretive differences between LLM architectures.
- →Standard accuracy-based evaluation metrics fail to detect when different models produce materially different policy-relevant categorizations of identical inputs.
- →Expert rubrics intended to resolve disagreement often suppress rather than resolve underlying interpretive complexity without improving consistency.
- →Human annotators introduce framings and interpretations absent from AI model ensembles, highlighting qualitative differences in how humans and LLMs approach interpretive tasks.
- →Disagreement-based evaluation should complement accuracy metrics in government AI deployments to maintain transparency and democratic legitimacy in policy analysis.