OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
OmniDrive-R1 is a new Vision-Language Model framework that addresses critical reliability failures in autonomous driving by combining perception and reasoning through an interleaved multi-modal chain-of-thought mechanism, achieving significant accuracy improvements (37.81% to 73.62%) without requiring dense localization labels.
OmniDrive-R1 represents a meaningful advance in making Vision-Language Models safer for autonomous driving applications. The core problem it solves is object hallucination—where VLMs generate false detections—by moving away from purely text-based reasoning toward integrated visual and textual analysis. Traditional multi-modal approaches separate perception from reasoning, creating optimization gaps and requiring expensive labeled datasets. OmniDrive-R1 eliminates these bottlenecks through reinforcement-driven visual grounding that allows the model to autonomously focus on critical regions, functioning similarly to how human drivers concentrate on relevant hazards.
The technical innovation centers on Clip-GRPO, a reinforcement learning algorithm that provides annotation-free reward signals based on cross-modal consistency. This eliminates the need for dense ground-truth labels while avoiding instability from external tool dependencies. The two-stage reinforcement learning pipeline trains the model to iteratively improve both what it sees and how it reasons about visual scenes. Experimental results on DriveLMM-o1 demonstrate dramatic improvements: reasoning scores jumped from 51.77% to 80.35%, while final answer accuracy nearly doubled from 37.81% to 73.62%.
For the autonomous driving industry, this work signals progress toward more reliable AI systems in safety-critical contexts. Reduced hallucination and improved reasoning directly translate to safer perception pipelines. The annotation-free reward mechanism also reduces deployment costs, making the approach economically viable for real-world systems. Future focus should center on testing performance across diverse driving scenarios, edge cases, and real-world conditions beyond benchmark datasets. The framework's scalability to larger models and integration with actual autonomous vehicle stacks remains critical for commercial viability.
- →OmniDrive-R1 achieves 73.62% answer accuracy on autonomous driving tasks, up from 37.81% baseline performance.
- →Interleaved multi-modal chain-of-thought reasoning eliminates object hallucination in Vision-Language Models.
- →Annotation-free reinforcement learning rewards reduce labeling costs while improving model reliability.
- →The approach integrates perception and reasoning end-to-end rather than in separate pipeline stages.
- →Safety-critical autonomous driving applications may benefit from improved VLM consistency and reduced false detections.