A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice
Researchers introduce the first formal framework for evaluating how humans should appropriately rely on set-valued AI advice (discrete sets or continuous intervals) rather than point predictions. The framework defines metrics for both classification and regression tasks, addressing a gap in human-AI collaboration research by measuring not just whether advice is followed, but whether that reliance actually improves decision-making outcomes.
This research addresses a fundamental challenge in human-AI collaboration that has grown increasingly urgent as AI systems become integrated into consequential decision-making processes. While prior work focused on point predictions, real-world AI systems often communicate uncertainty through ranges, confidence intervals, or discrete option sets. This paper fills that methodological gap by formalizing how to measure whether humans are relying on AI advice appropriately—a distinction between simply using advice and using it productively.
The framework distinguishes between classification and regression contexts, recognizing that different decision environments require different evaluation approaches. For classification, the researchers introduce correct reliance rates that capture whether humans trust AI when it should be trusted and self when that serves them better. For regression tasks, they separate quantity of reliance (whether advice was consulted) from quality of reliance (whether following it improved outcomes). This nuance matters because a human might use AI advice but still make suboptimal decisions, or conversely, ignore advice that would have helped.
For AI developers and organizations deploying human-in-the-loop systems, this framework provides concrete metrics for understanding collaborative effectiveness beyond raw accuracy. Industries relying on human-AI teams—from healthcare diagnostics to financial forecasting—need principled ways to measure collaboration quality. The framework enables more rigorous evaluation of whether uncertainty communication strategies actually improve human decision-making or simply shift blame. This supports better system design that genuinely enhances human capabilities rather than merely automating judgment.
- →First formal framework for evaluating set-valued AI advice rather than single-point predictions
- →Introduces distinct metrics for classification (correct reliance rates) and regression (quantity vs. quality of reliance)
- →Measures whether humans are appropriately trusting or distrusting AI advice, not just whether they use it
- →Addresses gap in human-AI collaboration research by capturing nuances existing metrics overlook
- →Applicable to industries requiring human-in-the-loop decision systems across multiple domains