AINeutralarXiv – CS AI · Apr 156/10
🧠
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Researchers propose League of LLMs (LOL), a benchmark-free evaluation framework that uses mutual peer assessment among multiple LLMs to overcome data contamination and evaluation bias issues. Testing on eight mainstream models reveals 70.7% ranking consistency while uncovering model-specific behaviors like memorization patterns and family-based scoring bias in OpenAI models.
🏢 OpenAI