AINeutralarXiv โ CS AI ยท 4h ago6/10
๐ง
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Researchers propose League of LLMs (LOL), a benchmark-free evaluation framework that uses mutual peer assessment among multiple LLMs to overcome data contamination and evaluation bias issues. Testing on eight mainstream models reveals 70.7% ranking consistency while uncovering model-specific behaviors like memorization patterns and family-based scoring bias in OpenAI models.
๐ข OpenAI