Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Researchers propose using multidimensional self-assessment based on cognitive appraisal theory to predict LLM failures more reliably than confidence alone. Testing across 12 models and 38 tasks, they find effort and ability dimensions consistently outperform confidence, with task type determining which dimension proves most predictive.
This research addresses a fundamental challenge in deploying large language models: determining when they're likely to fail. Traditional approaches rely on confidence scores, but LLMs consistently express overconfidence in incorrect answers, creating dangerous blind spots in high-stakes applications like healthcare, legal analysis, and financial advising. The study's innovation lies in decomposing self-assessment into distinct psychological dimensions—effort, ability, and affective factors—rather than treating confidence as monolithic.
The findings emerge from cognitive psychology research showing humans evaluate themselves through multiple, independent mechanisms. By applying this framework to LLMs, the researchers discovered that effort-related assessments (how hard the model perceives a task) provide more honest, stable predictions of correctness than verbalized confidence. This stability across model sizes suggests effort captures something fundamental about task difficulty rather than reflecting mere model size or training data artifacts.
For AI safety and deployment, these results carry substantial implications. Organizations implementing LLMs in decision-critical contexts could use multidimensional assessments to route uncertain cases to human review more effectively. The finding that task characteristics determine which dimension matters most—effort for reasoning, ability for retrieval—enables more granular calibration strategies. This moves beyond one-size-fits-all confidence thresholds toward context-aware reliability metrics.
The research also opens practical pathways for improvement. Rather than training models to express better confidence, developers can prompt or fine-tune systems to provide structured self-assessments that map onto verifiable psychological dimensions. As LLM deployment accelerates across industries, robust failure prediction becomes increasingly critical for liability, user trust, and safety.
- →Effort and ability dimensions predict LLM failures more reliably than confidence across most tasks and models
- →Effort-based assessments remain stable and less overoptimistic regardless of model size, unlike confidence scores
- →Different task types benefit from different self-assessment dimensions, enabling more targeted reliability strategies
- →Multidimensional self-assessment grounded in cognitive psychology offers a practical framework for improving LLM safety in deployment
- →Structured self-evaluation could reduce overconfident predictions without requiring architectural model changes