AINeutralarXiv – CS AI · 14h ago6/10
🧠
When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis
Researchers propose an Interpretive Audit Pipeline that uses multi-model disagreement to improve how federal agencies evaluate LLM categorization of public comments. Analysis of 1,260 USDA comments across four LLMs reveals significant interpretive divergence between models, suggesting that standard accuracy metrics alone miss critical differences in how AI systems organize policy input.