🧠 AI⚪ NeutralImportance 7/10

Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

arXiv – CS AI|Xu Zhang, Xudong Gong, Jiacheng Qin, Qiang Wang, JiaQi Liao, Zhe Wang, Dawei Feng, Bo Ding|April 15, 2026 at 04:00 AM

🤖AI Summary

Researchers propose a cognitive diagnostic framework that evaluates large language models across fine-grained ability dimensions rather than aggregate scores, enabling targeted model improvement and task-specific selection. The approach uses multidimensional Item Response Theory to estimate abilities across 35 dimensions for mathematics and generalizes to physics, chemistry, and computer science with strong predictive accuracy.

Analysis

Current LLM evaluation methodologies collapse diverse capabilities into single benchmark scores, masking critical performance variations across different problem types and cognitive requirements. This research addresses a fundamental gap in model assessment by introducing a principled diagnostic approach grounded in cognitive theory. The framework constructs detailed ability taxonomies—35 dimensions for mathematics, 27 for physics, 58 for chemistry, and 12 for computer science—then applies multidimensional Item Response Theory to estimate fine-grained ability levels that correlate with actual model performance.

The approach demonstrates substantial practical utility through its predictive power, achieving AUC scores between 0.80-0.89 within benchmarks and 0.77-0.86 across different benchmarks when predicting performance on unseen questions. These results significantly exceed baseline models, suggesting the framework captures meaningful ability structure rather than noise. The consistency of ability estimates across multiple benchmarks strengthens confidence in the methodology's validity.

For the AI development ecosystem, this framework enables more sophisticated model selection and development strategies. Rather than selecting models based on single-score leaderboards, practitioners can match specific models to tasks based on their fine-grained ability profiles. This diagnostic perspective supports targeted training interventions, allowing researchers to identify and address specific capability gaps. Organizations can design benchmarks strategically around ability dimensions rather than arbitrary task collections.

Future applications may include ability-aware benchmark design that systematically covers cognitive skill gaps, personalized fine-tuning strategies targeting weak dimensions, and improved transfer learning predictions based on ability overlap between domains. The framework's successful generalization across scientific domains suggests broader applicability to other specialized knowledge areas.

Key Takeaways

→Fine-grained diagnostic framework reveals LLM ability variations obscured by aggregate benchmark scores
→Multidimensional Item Response Theory enables prediction of model performance on unseen questions with AUC 0.77-0.89
→Framework successfully generalizes across mathematics, physics, chemistry, and computer science with domain-specific ability taxonomies
→Enables ability-guided model selection and targeted training strategies beyond conventional leaderboard rankings
→Opens pathways for more sophisticated benchmark design based on cognitive ability structure