Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs
Researchers propose Faithful-First RPA, a framework that improves multimodal AI reasoning by prioritizing faithfulness to visual evidence. The method uses FaithEvi for supervision and FaithAct for execution, achieving up to 24% improvement in perceptual faithfulness without sacrificing task accuracy.
Multimodal large language models demonstrate impressive capabilities but suffer from a critical flaw: they generate reasoning that contradicts visual inputs or diverges from their own conclusions, a phenomenon known as unfaithfulness. This paper addresses a fundamental challenge in AI reliability by introducing a framework that treats faithfulness as a primary objective rather than an afterthought. The Faithful-First RPA approach uses two complementary components: FaithEvi evaluates whether intermediate reasoning steps align with visual evidence at both step and chain levels, while FaithAct leverages these faithfulness signals to guide action planning during inference. The technical innovation lies in providing explicit supervision signals that keep models accountable to their perceptual inputs throughout reasoning chains. This work addresses a persistent concern in deployed AI systems where models hallucinate or contradict themselves, undermining user trust. The 24% improvement in perceptual faithfulness without accuracy degradation suggests the approach successfully decouples hallucination reduction from task performance. For developers and researchers, this represents progress toward interpretable, trustworthy multimodal AI systems. The unified evaluation and enforcement framework provides both diagnostic tools and corrective mechanisms, enabling more principled development of multimodal models. As AI systems increasingly influence real-world decisions, ensuring their reasoning aligns with actual evidence becomes commercially and ethically critical. The code release facilitates adoption across research and industry applications.
- →Faithful-First RPA framework improves multimodal AI faithfulness to visual evidence by up to 24%
- →FaithEvi provides step-wise and chain-level supervision to catch reasoning drift from visual inputs
- →Method maintains task accuracy while reducing hallucination, avoiding common accuracy-faithfulness tradeoffs
- →Unified framework enables both evaluation and enforcement of faithfulness in multimodal reasoning systems
- →Code availability supports adoption in research and production AI applications