When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding
Researchers have identified a critical reliability flaw in multimodal large language models (MLLMs) used for video understanding: when the correct answer is absent from available options, these models fail to recognize it and instead select plausible incorrect alternatives. Testing across multiple models and benchmarks reveals this limitation is especially severe in temporal reasoning tasks and worsens with increased video frame sampling, with chain-of-thought prompting offering only partial mitigation.
This diagnostic study exposes a fundamental weakness in how MLLMs approach video understanding tasks when faced with deliberately incomplete answer sets. The research evaluates three distinct scenarios—multiple-choice with "None of the Above" options, open-ended generation with explicit detection instructions, and standard evaluation—and consistently finds that models gravitate toward plausible distractors rather than recognizing absence of correct answers. This behavior undermines the reliability claims often made about advanced language models in real-world applications.
The finding reflects a broader challenge in AI development: current scaling approaches optimize for pattern matching and prediction rather than explicit reasoning about absence or uncertainty. The tendency worsens with denser frame sampling suggests that information overload exacerbates the problem, pushing models toward confident but incorrect selections. This pattern mirrors known issues in other domains where MLLMs struggle with adversarial or edge-case scenarios.
For practitioners deploying these systems in critical applications—medical diagnosis, legal analysis, or safety-critical video monitoring—this limitation poses significant risks. Users may receive confident, plausible-sounding but fundamentally incorrect responses when no valid answer exists. The research demonstrates that prompting-based mitigations yield only marginal improvements, suggesting the issue requires architectural changes rather than prompt engineering fixes.
Future work likely requires integrating explicit confidence thresholding, uncertainty quantification mechanisms, or hybrid systems combining MLLMs with traditional symbolic reasoning. This study signals that current-generation models need fundamental redesigns to handle cases where "no answer" is the correct response, a capability essential for deployment in high-stakes domains.
- →MLLMs systematically fail to detect absent correct answers, instead selecting plausible distractors with high confidence
- →The problem intensifies in temporal reasoning tasks and deteriorates as video frame density increases
- →Chain-of-thought prompting improves detection rates substantially but falls short of solving the problem comprehensively
- →Prompt-based strategies alone are insufficient; explicit detection mechanisms must be built into model architecture
- →This limitation poses significant risks for deploying MLLMs in critical applications requiring high reliability