AIBearisharXiv – CS AI · 15h ago7/10
🧠
When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation
A research paper reveals that large language models used to create and evaluate benchmarks systematically favor themselves, introducing significant bias into automated evaluation systems. The self-bias stems from both test generation and evaluation stages, with stylistic tendencies creating model-specific outputs that inflate scores, even when diversity controls are explicitly applied.