AINeutralarXiv – CS AI · May 16/10
🧠
Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
Researchers adapted clinical psychology's Reliable Change Index to evaluate LLM performance across model versions, revealing that aggregate accuracy gains mask substantial item-level volatility. Testing Llama 3→3.1 and Qwen 2.5→3 showed bidirectional changes with large effect sizes, where improvements in low-accuracy domains offset deteriorations in high-accuracy ones, suggesting current evaluation methods underestimate model instability.
🧠 Llama