MACD: Model-Aware Contrastive Decoding via Counterfactual Data
Researchers introduce MACD, a new inference strategy that reduces hallucinations in video language models by using the model's own feedback to identify problematic visual regions and generate targeted counterfactual data. The method combines model-aware object-level modifications with contrastive decoding, showing consistent improvements across multiple benchmarks and video-LLM architectures.
Video language models, despite their capabilities, frequently generate plausible-sounding but factually incorrect descriptions when visual information is weak or ambiguous—a problem known as hallucination. This research addresses a fundamental limitation in how existing mitigation techniques approach the problem. Contrastive decoding has shown promise in reducing hallucinations, but traditional methods rely on random perturbations that don't specifically target the visual cues causing errors. MACD shifts this paradigm by making the mitigation process model-aware, leveraging the video-LLM's own internal signals to pinpoint which object regions drive hallucinations.
The innovation lies in the counterfactual construction strategy. Rather than applying arbitrary frame or temporal modifications, MACD generates targeted object-level counterfactual inputs based on model feedback. This precision approach aligns the counterfactual data directly with model weaknesses, making the contrastive decoding phase more effective at enforcing evidence-grounded token selection. The method demonstrates consistent hallucination reduction across EventHallusion, MVBench, Perception-test, and Video-MME benchmarks while maintaining or improving task accuracy across diverse architectures including Qwen and InternVL.
For the AI and ML community, this research improves the practical reliability of video understanding systems, particularly in challenging scenarios involving small, occluded, or overlapping objects. Better hallucination mitigation expands deployment possibilities for video-LLMs in safety-critical applications. The approach also provides a scalable inference-time solution that doesn't require retraining, making adoption straightforward. As video understanding becomes increasingly important for autonomous systems and content analysis, reducing hallucination through interpretable, model-aware methods strengthens the foundation for trustworthy AI applications.
- →MACD uses model feedback to identify object regions causing hallucinations rather than applying random perturbations.
- →The method combines model-aware counterfactual construction with contrastive decoding to enforce evidence-grounded token selection.
- →Testing across four benchmarks shows consistent hallucination reduction without sacrificing task accuracy.
- →Strongest improvements appear in challenging scenarios with small, occluded, or co-occurring objects.
- →The inference-time approach works across multiple video-LLM architectures without requiring model retraining.