CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving
Researchers introduce CLEAR, a new framework for autonomous driving that combines fast generative planning with semantic reasoning to address the latency problems of diffusion models. By replacing iterative denoising with single-step conditional drift in VAE latent space and fine-tuning language models for scene understanding, the system achieves state-of-the-art performance on the NAVSIM benchmark without sacrificing multi-modal trajectory generation.
CLEAR represents a meaningful advancement in resolving a critical bottleneck in autonomous driving development: the tension between model expressiveness and real-time safety requirements. Diffusion models excel at capturing diverse, realistic driving behaviors but their iterative denoising process introduces latency incompatible with safety-critical systems requiring immediate decisions. This research addresses that gap by collapsing the multi-step denoising process into a single-step operation within a learned latent space, fundamentally changing how the speed-accuracy tradeoff is managed.
The framework's innovation lies in its hybrid architecture combining Drive-JEPA for visual encoding with fine-tuned Qwen language models for scene comprehension. By extracting scene-aware hidden states, CLEAR enables intelligent routing decisions—an Adaptive Scheduler dynamically selects conditioning coefficients and sample counts based on context, while a cross-attention scorer filters optimal trajectories from candidates. This conditional approach preserves behavioral diversity where needed while ensuring precision in safety-critical decisions.
The 93.7 PDMS score on NAVSIM v1 demonstrates competitive performance without relying on dense geometric annotations or extensive sampling. This efficiency matters for deployment, as it reduces computational overhead on edge devices within vehicles. The research validates that end-to-end planning doesn't require the computational expense previously assumed necessary.
For the autonomous vehicle industry, this suggests a path toward production-ready systems that maintain behavioral flexibility without unacceptable inference costs. The approach could accelerate adoption timelines for level 4-5 autonomy, though real-world validation beyond benchmark environments remains essential before broader deployment confidence emerges.
- →CLEAR achieves 93.7 PDMS on NAVSIM v1 by replacing iterative diffusion denoising with single-step latent space operations
- →Integration of fine-tuned language models enables scene-aware decision-making that dynamically balances diversity and precision
- →The framework eliminates dependency on dense geometric annotations, reducing annotation overhead for training data
- →Single-step inference substantially reduces latency while maintaining multi-modal trajectory generation capabilities
- →Adaptive scheduling mechanism allows context-dependent performance tuning without model retraining