🧠 AI🟢 BullishImportance 7/10

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

arXiv – CS AI|Maoyuan Shao, Yutong Gao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Guoshun Nan|March 4, 2026 at 05:00 AM|3 views

🤖AI Summary

Researchers propose CAPT, a Confusion-Aware Prompt Tuning framework that addresses systematic misclassifications in vision-language models like CLIP by learning from the model's own confusion patterns. The method uses a Confusion Bank to model persistent category misalignments and introduces specialized modules to capture both semantic and sample-level confusion cues.

Key Takeaways

→CAPT framework successfully resolves 50.72% of confusable sample pairs in vision-language models across 11 benchmark datasets.
→The approach identifies that model confusion patterns are not random but occur consistently between specific category pairs, revealing intrinsic biases.
→The framework introduces three key components: Semantic Confusion Miner, Sample Confusion Miner, and Multi-Granularity Difference Expert module.
→The method enhances discriminability and generalization for both base and novel classes in cross-modal representation learning.
→CAPT demonstrates significant improvements in reducing confusion-induced errors while maintaining model performance on standard benchmarks.