ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning
Researchers introduce ACTIVE-o3, a reinforcement learning framework that enables Multimodal Large Language Models (MLLMs) to actively perceive and intelligently select regions of interest for visual analysis. The system outperforms GPT-o3's zoom strategy while maintaining general understanding capabilities, with applications spanning robotics, autonomous driving, and remote sensing.
ACTIVE-o3 addresses a fundamental limitation in how MLLMs process visual information. Current systems like GPT-o3 apply crude zoom strategies that lack efficiency and precision, struggling with tasks requiring focused attention on specific image regions. The research tackles this by embedding active perception—a core capability in human vision and embodied AI—directly into MLLM architectures through reinforcement learning.
The technical approach leverages GRPO (a reinforcement learning algorithm) combined with modular sensing-action design and dual-form rewards. Rather than requiring explicit supervision for region selection, the system autonomously learns where to focus, making it more scalable and generalizable. This represents a meaningful evolution in how foundation models handle multimodal reasoning.
The benchmark design demonstrates breadth across diverse domains: open-world tasks like small-object detection and dense-object grounding, plus domain-specific challenges in remote sensing, autonomous driving, and interactive segmentation. Results show significant improvements over baselines while preserving the model's general knowledge—a critical validation that specialized perception capabilities don't degrade broader reasoning.
For the AI industry, this work bridges a gap between perception and planning in embodied systems. As MLLMs increasingly serve as central decision-makers in robotics and autonomous systems, active perception becomes essential infrastructure. The framework's preservation of general abilities while adding specialized perception suggests a reusable pattern for other capability enhancements. The dual benefit—improved perception and utility as a pretraining proxy—indicates potential productivity gains in perception-heavy training pipelines.
- →ACTIVE-o3 autonomously learns efficient region-selection strategies without explicit supervision using reinforcement learning
- →The framework significantly outperforms GPT-o3's zoom strategy across multiple vision tasks including object grounding and autonomous driving
- →Active perception enhancement preserves general understanding capabilities and improves performance on broader benchmarks like RealWorldQA
- →The modular design enables application across diverse domains from robotics to remote sensing without task-specific retraining
- →Perception-focused training serves as a proxy task that improves downstream performance on unrelated multimodal benchmarks