y0news
#vision-language-action5 articles
5 articles
AIBearisharXiv โ€“ CS AI ยท 6h ago1
๐Ÿง 

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models

Researchers reveal that state-of-the-art Vision-Language-Action (VLA) models largely ignore language instructions despite achieving 95% success on standard benchmarks. The new LangGap benchmark exposes significant language understanding deficits, with targeted data augmentation only partially addressing the fundamental challenge of diverse instruction comprehension.

AIBullisharXiv โ€“ CS AI ยท 6h ago1
๐Ÿง 

Mean-Flow based One-Step Vision-Language-Action

Researchers developed a Mean-Flow based One-Step Vision-Language-Action (VLA) approach that dramatically improves robotic manipulation efficiency by eliminating iterative sampling requirements. The new method achieves 8.7x faster generation than SmolVLA and 83.9x faster than Diffusion Policy in real-world robotic experiments.

AIBullisharXiv โ€“ CS AI ยท 6h ago2
๐Ÿง 

ATA: Bridging Implicit Reasoning with Attention-Guided and Action-Guided Inference for Vision-Language Action Models

Researchers propose ATA, a training-free framework that improves Vision-Language-Action (VLA) models through implicit reasoning without requiring additional data or annotations. The approach uses attention-guided and action-guided strategies to enhance visual inputs, achieving better task performance while maintaining inference efficiency.

AIBullisharXiv โ€“ CS AI ยท 6h ago2
๐Ÿง 

Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation

Researchers introduce Pri4R, a new approach that enhances Vision-Language-Action (VLA) models by incorporating 4D spatiotemporal understanding during training. The method adds a lightweight point track head that predicts 3D trajectories, improving physical world understanding while maintaining the original architecture during inference with no computational overhead.

AIBullisharXiv โ€“ CS AI ยท 6h ago0
๐Ÿง 

Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining

Researchers introduce Keyframe-Chaining VLA, a new AI framework that improves robot manipulation for long-horizon tasks by extracting and linking key historical frames to model temporal dependencies. The method addresses limitations in current Vision-Language-Action models that struggle with Non-Markovian dependencies where optimal actions depend on specific past states rather than current observations.