AINeutralarXiv โ CS AI ยท 8h ago6/10
๐ง
Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
Researchers adapted clinical psychology's Reliable Change Index to evaluate LLM performance across model versions, revealing that aggregate accuracy gains mask substantial item-level volatility. Testing Llama 3โ3.1 and Qwen 2.5โ3 showed bidirectional changes with large effect sizes, where improvements in low-accuracy domains offset deteriorations in high-accuracy ones, suggesting current evaluation methods underestimate model instability.
๐ง Llama