Toward Human-AI Complementarity Across Diverse Tasks
A research study evaluates whether combining human and AI judgments can improve decision-making across diverse tasks, finding only modest complementarity gains of 0.4 percentage points. The primary bottleneck identified is not human accuracy but rather the inability to effectively route decisions to humans when needed and design assistance methods that help humans catch AI mistakes.
This research addresses a critical challenge in AI deployment: human-AI complementarity as a mechanism for AI oversight. The study tested three approaches—baseline hybridization, top-2 assistance, and subtask delegation—on 1,886 samples spanning knowledge, factuality, reasoning, and deception detection tasks. Results reveal a sobering reality for those betting on simple human-AI collaboration: the 0.4 percentage point improvement over AI-alone performance falls far short of complementarity's promise. The core limitation stems from two factors: only 8.9% of test items exhibited the ideal scenario where AI failed but humans succeeded, and confidence-based routing proved ineffective since model confidence distributions were similar for both correct and incorrect predictions. The top-2 assistance method showed more promise, improving human accuracy from 28.4% to 38.3%, but this gain primarily reflected humans accepting correct AI suggestions rather than successfully identifying and overriding AI errors. This finding carries significant implications for AI safety and oversight frameworks. Many proposed governance models assume humans can reliably catch AI mistakes when appropriately informed, but the research suggests this assumption may be overstated. The bottleneck isn't task difficulty per se but rather identifying the precise moments when human judgment is genuinely needed and designing interfaces that empower rather than anchor human decision-makers. These results suggest future work should focus on improved decision-routing mechanisms and assistance designs that highlight AI uncertainty and enable genuine human override capabilities rather than assuming humans will naturally catch errors in high-confidence AI outputs.
- →Human-AI complementarity on realistic tasks yields only modest performance gains of 0.4 percentage points, challenging assumptions about oversight effectiveness.
- →The primary bottleneck is routing decisions to humans at the right moments, not improving human task accuracy itself.
- →Confidence-based routing fails because AI model confidence is similarly distributed across correct and incorrect predictions.
- →Top-2 assistance improves human performance primarily through AI suggestion adoption, not human error detection.
- →Future AI oversight design must focus on identifying decision points where humans add value and creating interfaces that enable genuine human override capability.