AINeutralarXiv โ CS AI ยท 4h ago7/10
๐ง
Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities
Researchers propose a cognitive diagnostic framework that evaluates large language models across fine-grained ability dimensions rather than aggregate scores, enabling targeted model improvement and task-specific selection. The approach uses multidimensional Item Response Theory to estimate abilities across 35 dimensions for mathematics and generalizes to physics, chemistry, and computer science with strong predictive accuracy.