#item-response-theory News & Analysis

6 articles tagged with #item-response-theory. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

Researchers introduce Item Response Scaling Laws (IRSL), a framework that dramatically reduces computational costs for estimating language model performance by decomposing the problem into model ability and question difficulty components. The approach achieves 99.9% reduction in required evaluation samples while maintaining or exceeding accuracy of traditional scaling law methods.

AINeutralarXiv – CS AI · May 97/10

🧠

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

Researchers propose Dynamic Boundary Evaluation (DBE), a new methodology for assessing large language models that adapts to each model's capability level rather than applying fixed benchmarks. The approach identifies performance boundaries where models achieve ~50% accuracy and calibrates them on a unified difficulty scale, addressing limitations in traditional evaluation that produce ceiling and floor effects masking true capability gaps.

AINeutralarXiv – CS AI · Apr 157/10

🧠

Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

Researchers propose a cognitive diagnostic framework that evaluates large language models across fine-grained ability dimensions rather than aggregate scores, enabling targeted model improvement and task-specific selection. The approach uses multidimensional Item Response Theory to estimate abilities across 35 dimensions for mathematics and generalizes to physics, chemistry, and computer science with strong predictive accuracy.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Efficient Safety Benchmarking via Item Response Theory

Researchers propose using Item Response Theory (IRT) to dramatically reduce the computational cost of safety benchmarking for language models, achieving 80-99.8% cost reductions while maintaining ranking accuracy. The approach addresses the inefficiency of current static evaluation paradigms that treat all test items equally, enabling more scalable safety assessment as AI systems become increasingly complex.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

Researchers introduce a diagnostic framework using Item Response Theory (IRT) to assess the reliability of Large Language Models used as automated judges. The framework evaluates LLM judges on two dimensions: intrinsic consistency (stability under prompt variations) and human alignment (correspondence with human assessments), providing practical guidance for identifying unreliability sources.

AINeutralarXiv – CS AI · May 116/10

🧠

An Interpretable and Scalable Framework for Evaluating Large Language Models

Researchers introduce a scalable framework for evaluating large language models using Item Response Theory and majorization-minimization algorithms, achieving orders-of-magnitude speedups while improving interpretability. The method addresses computational limitations of traditional benchmarking approaches and provides insights into model abilities and benchmark item characteristics.