PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
PANDO introduces an efficient multimodal AI agent framework that improves performance while reducing computational costs through online skill distillation, achieving 58.3% success on VisualWebArena tasks with 58-61% fewer tokens than competing approaches. The system addresses inefficiencies in web agent design by maintaining a skill library and employing hierarchical routing, visual compression, and cache-aware prompting without requiring expensive pre-evaluation.
PANDO represents a meaningful shift in how multimodal AI agents optimize performance-to-cost ratios. Rather than pursuing brute-force approaches that increase inference-time computation through rollout search and specialist model stacks, this framework demonstrates that agents can become more efficient as they learn from experience. The research identifies three concrete inefficiency sources—repeat-action loops, hidden discovery costs, and poor prompt-cache reuse—that plague current web agents, then systematically addresses each through targeted architectural choices.
The broader context shows the AI research community increasingly recognizing computational efficiency as critical as raw capability. Previous approaches like SGV and WALT prioritized performance metrics while accumulating substantial token overhead, creating an unsustainable trajectory as agent complexity grows. PANDO's single-rollout design with online skill distillation suggests that structured learning mechanisms can substitute for brute-force inference.
For developers and researchers building autonomous agents, this work provides immediate practical value. The introduction of trajectory-level efficiency metrics—Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization—enables more nuanced evaluation beyond binary success rates. This methodological contribution may reshape how the community benchmarks agent quality.
The implications extend beyond web agents to any multimodal AI system facing real-time performance constraints. As large language models become commodity infrastructure, the competitive advantage increasingly derives from efficient orchestration rather than model scale. Future development will likely focus on whether PANDO's skill-library approach generalizes to other domains beyond web automation and how it performs on emerging benchmark tasks with greater complexity.
- →PANDO achieves 58.3% success rate on VisualWebArena while using 58-61% fewer tokens than comparable systems, demonstrating efficiency gains through experience rather than increased compute.
- →Online skill distillation with a structured skill library enables agents to improve efficiency incrementally without expensive offline discovery phases.
- →Three newly introduced trajectory-level metrics (Action Repetition Rate, Step Overhead Ratio, Prompt Cache Utilization) provide more granular evaluation of agent efficiency beyond terminal success.
- →The framework combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting to address identified inefficiency sources.
- →Results suggest architectural optimization and learned routing can outperform multi-rollout search approaches, shifting focus from brute-force inference to structured learning.