CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
CrossVL introduces a novel framework combining Complexity-Aware Pathway Aggregation and Paired Curriculum Learning to improve vision-language model performance in cross-view object detection scenarios. The approach addresses fundamental challenges when models operate across different viewpoints (ground and aerial), achieving measurable improvements in detection accuracy and consistency on the MAVREC dataset.
CrossVL addresses a meaningful technical challenge in computer vision: vision-language models struggle when visual perspectives shift dramatically between ground-level and aerial viewpoints due to differences in scale, occlusion patterns, and spatial organization. The framework recognizes that fixed fusion mechanisms cannot adapt to these geometric variations, proposing instead a dynamic routing system that estimates scene complexity and directs visual features through appropriate processing pathways.
The technical innovation combines two complementary strategies. Complexity-Aware Pathway Aggregation analyzes multimodal statistics to gauge scene difficulty and route features accordingly, while Paired Curriculum Learning leverages the semantic consistency between synchronized ground-aerial image pairs during training, gradually transitioning to randomized sampling. This dual approach targets both architectural limitations and optimization dynamics.
The empirical results demonstrate meaningful progress: Florence-2's aerial detection performance improved from 58.66% to 61.03% mAP, while the performance gap between ground and aerial views narrowed from 8.63 percentage points to 6.65 percentage points. More significantly, variance across random seeds reduced by 3.3x, indicating the framework produces more stable and reproducible results.
For the AI/computer vision community, this work illustrates how awareness of domain-specific geometric challenges can drive architectural innovations beyond standard transfer learning approaches. The reduction in cross-view performance degradation has practical implications for autonomous systems, surveillance, and mapping applications that must operate across multiple altitudes and perspectives. The curriculum learning component also demonstrates how training methodology can be as important as architectural design in handling distributional shifts inherent to real-world deployment scenarios.
- βCrossVL improves aerial view object detection by 2.37 mAP percentage points through complexity-aware feature routing and curriculum learning.
- βThe framework reduces performance variance across random seeds by 3.3x, indicating more stable and reproducible model behavior.
- βPaired Curriculum Learning leverages ground-aerial image pair consistency as stable early supervision before transitioning to randomized sampling.
- βCross-view performance gap narrowed from 8.63pp to 6.65pp, demonstrating meaningful progress toward view-invariant detection.
- βThe approach combines architectural innovations with training methodology improvements rather than relying solely on model scaling.