Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving
Researchers introduce Drive-P2D, a comprehensive benchmark for evaluating vision-language models in autonomous driving that tests perception and decision-making across progressive complexity levels. The benchmark addresses gaps in existing evaluation methods by separating reasoning analysis from objective answer scoring and identifying specific failure modes that could improve VLM safety for real-world deployment.
Drive-P2D represents a methodological advance in how the AI research community assesses autonomous driving capabilities. Rather than treating perception and decision-making as isolated tasks, the benchmark creates a progressive chain that mirrors how autonomous systems must actually operate—gathering visual information, interpreting scenes, and making safety-critical decisions. This integrated approach exposes brittleness that isolated testing would miss.
The benchmark's design choices address genuine limitations in current evaluation practices. By separating reasoning from final answers, researchers can analyze failure modes without LLM scoring bias contaminating results. The 6,650 questions spanning object-level, scene-level, and decision-level tasks provide granular insight into where VLMs break down. Testing on high-risk scenarios and conducting robustness testing across similar scenes reveals whether models achieve genuine understanding or merely pattern-match.
For the autonomous driving industry, this work matters because it provides a more realistic assessment of VLM reliability before deployment. Identifying specific error modes—logical reasoning failures and semantic feature omissions—creates actionable insights for model improvement rather than abstract performance metrics. The development of a lightweight analyzer model for automating error annotation suggests a scalable path toward continuous safety assessment.
As VLMs increasingly enter safety-critical applications, benchmarks like Drive-P2D become essential infrastructure. The framework could inform which VLMs are genuinely suitable for autonomous systems versus those requiring significant refinement. Ongoing research in this space will likely focus on whether identified failure modes can be systematically addressed through training approaches or architectural changes.
- →Drive-P2D benchmark evaluates VLMs on integrated perception-to-decision tasks rather than isolated perception or decision-making components.
- →Separated reasoning-and-answer protocol prevents LLM scoring bias while enabling detailed analysis of failure modes like logical reasoning errors and semantic omissions.
- →Testing across high-risk scenarios and similar-scene robustness reveals whether VLMs achieve genuine understanding or superficial pattern matching.
- →Automated error-mode annotation through lightweight analyzer models enables scalable assessment of VLM reliability for autonomous driving.
- →Benchmark identifies specific capability boundaries that inform which VLMs are suitable for safety-critical autonomous driving applications.