🧠 AI⚪ NeutralImportance 6/10

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

arXiv – CS AI|Sangoh Lee, Sangwoo Mo, Wook-Shin Han|June 19, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Visual Attentive Prompting (VAP), a training-free method that enables Vision-Language-Action models to perform personalized object manipulation tasks by using reference images to identify specific instances of objects. The approach bridges the gap between semantic understanding and instance-level control, allowing robots to execute commands like 'bring my cup' by distinguishing target objects from visually similar alternatives without requiring model retraining.

Analysis

The paper addresses a critical limitation in current Vision-Language-Action models: their inability to distinguish between specific instances of visually similar objects when executing personalized commands. Traditional VLA systems excel at understanding general instructions but fail when tasked with identifying user-specific objects unseen during training. This represents a fundamental gap between how humans interact with robots—through personalized references—and how current AI systems operate.

The development of VAP emerges from the broader trend of making foundation models more adaptable without expensive retraining cycles. Rather than fine-tuning frozen VLA models, VAP functions as a perceptual adapter that leverages reference images as non-parametric visual memory. The system employs open-vocabulary detection and embedding-based matching to ground personal objects, then injects this information back into the model through visual prompting and instruction rewriting. This approach exemplifies the growing emphasis on prompt-based adaptation methods in AI systems.

For robotics developers and deployment scenarios, this work enables practical personalization of robotic systems without model modification, reducing computational overhead and implementation complexity. The construction of multiple benchmarks—Personalized-SIMPLER, Personalized-VLABench, and real-world tabletop evaluations—demonstrates experimental rigor and provides evaluation frameworks for future work. The consistent outperformance over baseline methods suggests viable pathways for commercial robotic applications requiring user-specific object manipulation.

Future development hinges on scaling VAP to more complex visual scenes, handling occlusion robustly, and validating performance across diverse real-world environments beyond tabletop settings. The work establishes foundational methodology that could accelerate practical deployment of personalized robotic assistants.

Key Takeaways

→VAP enables frozen Vision-Language-Action models to identify and manipulate user-specific objects without retraining using only reference images.
→The training-free approach uses reference images as visual memory combined with open-vocabulary detection and embedding matching for instance-level grounding.
→Two simulation benchmarks and real-world evaluation show VAP outperforms generic policies and token-learning baselines in personalized manipulation tasks.
→The method addresses the critical gap between semantic understanding and instance-level control in robotic object manipulation.
→This work demonstrates practical personalization of foundation models through prompt-based adaptation without computational overhead of fine-tuning.

#vision-language-action #robotic-manipulation #prompt-engineering #personalization #object-detection #foundation-models #visual-grounding #ai-research

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge