AIBullisharXiv โ CS AI ยท 6h ago1
๐ง
CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
Researchers introduce CARE, a new framework for improving LLM evaluation by addressing correlated errors in AI judge ensembles. The method separates true quality signals from confounding factors like verbosity and style preferences, achieving up to 26.8% error reduction across 12 benchmarks.