🧠 AI🟢 BullishImportance 6/10

Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge

arXiv – CS AI|Yoshinari Fujinuma|April 10, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that Large Language Models used as judges suffer from score range bias, where evaluation outputs are highly sensitive to predefined scoring scales. Using contrastive decoding techniques, they achieve up to 11.7% improvement in alignment with human judgments across different score ranges.

Analysis

Large Language Models have become ubiquitous in automated evaluation tasks, yet their reliability as judges remains problematic. This research identifies and addresses a fundamental limitation: LLM evaluators show inconsistent behavior when scoring ranges change, undermining the validity of comparative assessments. The bias affects not only different models but also variants within the same model family, suggesting the issue is systemic rather than isolated.

The challenge emerges because LLMs generate outputs influenced by contextual anchoring—the predefined score ranges act as implicit constraints that skew the model's probability distributions. When evaluating summarization quality without reference materials, this bias becomes particularly pronounced, as models lack grounding for consistent calibration. This phenomenon directly impacts any application relying on LLM-based evaluation, from research benchmarking to content moderation systems.

The proposed contrastive decoding solution improves correlation with human judgments by leveraging opposing model behaviors to identify and suppress biased outputs. The 11.7% relative improvement demonstrates meaningful progress toward more robust evaluation systems. This advance matters for AI researchers, companies building evaluation pipelines, and anyone using LLMs to assess quality at scale.

Future developments should focus on whether contrastive decoding generalizes across domains beyond summarization and how it performs with proprietary models. The stability improvements could reshape how AI systems conduct self-evaluation and peer assessment, particularly in autonomous quality control systems where bias introduces compounding errors.

Key Takeaways

→LLM judges exhibit score range bias, producing outputs highly sensitive to predefined scoring scales
→Contrastive decoding achieves 11.7% relative improvement in alignment with human judgment
→Score range bias exists across model families, indicating a systemic rather than isolated problem
→The bias affects direct assessment tasks without reference materials, particularly in summarization
→Improved judge reliability directly benefits research benchmarking and automated evaluation systems