y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 6/10

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

arXiv – CS AI|Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang|
🤖AI Summary

Researchers introduce VLA-Thinker, a new AI framework that enhances Vision-Language-Action models by enabling dynamic visual reasoning during robotic tasks. The system achieved a 97.5% success rate on LIBERO benchmarks through a two-stage training pipeline combining supervised fine-tuning and reinforcement learning.

Key Takeaways
  • VLA-Thinker treats visual perception as a dynamically invocable reasoning action rather than static context.
  • The framework uses a two-stage training pipeline with supervised fine-tuning followed by GRPO-based reinforcement learning.
  • Achieved 97.5% success rate on LIBERO benchmark and strong performance on RoboThin 2.0 long-horizon tasks.
  • Addresses limitations of existing text-based chain-of-thought reasoning in embodied AI systems.
  • The thinking-with-image approach allows models to actively revisit environments and resolve ambiguities during complex tasks.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles